Support fp16 scale tolerance #829

Quentin-Anthony · 2023-03-13T18:51:46Z

Fairseq supports a "fp16 scale tolerance" this would allow a percent of overflows to happen before reducing the loss scale, in case the loss has just spiked due to a difficult sample rather than true loss explosion. E.g. fp16_scale_tolerance=0.25 would allow one out of every 4 updates to overflow before lowering the loss scaling.

https://github.com/facebookresearch/fairseq/blob/0338cdc3094ca7d29ff4d36d64791f7b4e4b5e6e/fairseq/dataclass/configs.py#L172 and https://github.com/facebookresearch/fairseq/blob/0338cdc3094ca7d29ff4d36d64791f7b4e4b5e6e/fairseq/optim/dynamic_loss_scaler.py#L52

We should add support for this for runs that tend to be unstable. It will probably require implementation at the deepspeed/deeperspeed layer though.

StellaAthena · 2023-03-13T19:54:04Z

My inclination is that anything that requires touching DeepSpeed probably isn't a "good first issue."

Otherwise I think this is a very good idea. I think @crowsonkb has something like this running in her codebases actually, and may be able to port some of it over.

Quentin-Anthony · 2023-03-13T22:47:01Z

Noted! Removing the first issue tag.

@crowsonkb -- Can you link to your implementation of this feature? It'd be great to use any tricks you found as a reference.

For whoever picks this up, the fairseq code should also should be pretty easily ported to DeepSpeed.

convexstrictly · 2023-05-16T18:31:34Z

Right now, DeepSpeed's optimizer-wrapper DeepSpeedZeroOptimizer uses DynamicLossScaler which it creates internally. I don't see any obvious way to tell it to use a custom loss scaler. What was the thinking on the implementation strategy for this feature?

I assume everyone is aware that the DynamicLossScaler provides related functionality through the hysteresis flag which delays loss downscaling until the specified number of consecutive updates that overflow. Not quite the same as scale tolerance though.

Quentin-Anthony · 2023-05-18T19:53:01Z

Right now, DeepSpeed's optimizer-wrapper DeepSpeedZeroOptimizer uses DynamicLossScaler which it creates internally. I don't see any obvious way to tell it to use a custom loss scaler. What was the thinking on the implementation strategy for this feature?

I assume everyone is aware that the DynamicLossScaler provides related functionality through the hysteresis flag which delays loss downscaling until the specified number of consecutive updates that overflow. Not quite the same as scale tolerance though.

The idea is to have a replenishing hysteresis resource for runs that periodically face instabilities. While looking through the DeepSpeed code to provide an outline of what I'd like to see, I stumbled across this dead code for consecutive_hysteresis, which I exposed to users in microsoft/DeepSpeed#3553. I'm going to rebase my PR to DeeperSpeed as well so that we can use it while the DeepSpeed team reviews it.

Quentin-Anthony · 2023-05-18T20:37:58Z

Closing this since it was added to DeeperSpeed in microsoft/DeepSpeed@f2acc07

convexstrictly · 2023-05-20T01:01:28Z

Great that this functionality is going into DeepSpeed as well! Definitely should be more robust than the consecutive overflow strategy.

I hadn't realized that that the ostensible dependencies on DeepSpeed were actually on DeeperSpeed (I should have looked at the requirements.txt), and that pushing the updates to DeepSpeed was an option. It makes a lot of sense.

Quentin-Anthony added feature request New feature or request good first issue Good for newcomers labels Mar 13, 2023

Quentin-Anthony removed the good first issue Good for newcomers label Mar 13, 2023

Quentin-Anthony added the help wanted This issue needs assistance label Mar 27, 2023

Quentin-Anthony closed this as completed May 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support fp16 scale tolerance #829

Support fp16 scale tolerance #829

Quentin-Anthony commented Mar 13, 2023

StellaAthena commented Mar 13, 2023

Quentin-Anthony commented Mar 13, 2023

convexstrictly commented May 16, 2023

Quentin-Anthony commented May 18, 2023

Quentin-Anthony commented May 18, 2023

convexstrictly commented May 20, 2023

Support fp16 scale tolerance #829

Support fp16 scale tolerance #829

Comments

Quentin-Anthony commented Mar 13, 2023

StellaAthena commented Mar 13, 2023

Quentin-Anthony commented Mar 13, 2023

convexstrictly commented May 16, 2023

Quentin-Anthony commented May 18, 2023

Quentin-Anthony commented May 18, 2023

convexstrictly commented May 20, 2023