Model does not train without FP16 #322

kipgparker · 2021-05-11T20:06:43Z

Currently model does not train without FP16

if you set fp16["enabled"] = false in the "small.yml" config and run pretrain_gpt2.py it crashes with an error

  File "pretrain_gpt2.py", line 26, in <module>
    iteration = train(
  File "/home/aleph/repos/neox/gpt-neox/megatron/training.py", line 461, in train
    overflow_monitor.check(skipped_iter)  # check for repeated overflow
  File "/home/aleph/repos/neox/gpt-neox/megatron/utils.py", line 366, in check
    pretrain(neox_args=neox_args)
  File "/home/aleph/repos/neox/gpt-neox/megatron/training.py", line 96, in pretrain
    if self.optimizer.overflow and len(self.history) == self.n and all(self.history):
AttributeError: 'FusedAdam' object has no attribute 'overflow'```

Running on 2x3090's

The text was updated successfully, but these errors were encountered:

sdtblck · 2021-05-11T20:59:01Z

@kipgparker that's intended behaviour for now - deepspeed in general does not work without fp16
My bf16 branch is almost ready to merge, though.

kipgparker added the bug Something isn't working label May 11, 2021

sdtblck closed this as completed Jun 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model does not train without FP16 #322

Model does not train without FP16 #322

kipgparker commented May 11, 2021

sdtblck commented May 11, 2021

Model does not train without FP16 #322

Model does not train without FP16 #322

Comments

kipgparker commented May 11, 2021

sdtblck commented May 11, 2021