Take ZeRO 3 for a test drive #171

StellaAthena · 2021-03-08T21:33:51Z

Is your feature request related to a problem? Please describe.
Model too smol

Describe the solution you'd like
DeepSpeed ZeRO-3 is finally public! Let’s take it for a test drive (remember to turn pipeline parallelism off) and see if we can get it to run.

Describe alternatives you've considered
Use Pipeline Parallelism

Additional context
It probably doesn’t work or needs to be modded to work because DeepSpeed :works-internally:

StellaAthena · 2021-03-09T18:18:11Z

Code is implemented, running into some compatibility issues due to our mods. Shouldn’t be hard to resolve.

StellaAthena · 2021-03-09T19:51:53Z

Note: Do not use scientific notation in the config file. The config monster parses it as a string instead of a number (cc: @joshlk)

Current error is

10.141.246.144: Traceback (most recent call last):
10.141.246.144:   File "pretrain_gpt2.py", line 193, in <module>
10.141.246.144:     pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
10.141.246.144:   File "/home/mchorse/gpt-neox/megatron/training.py", line 92, in pretrain
10.141.246.144:     model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
10.141.246.144:   File "/home/mchorse/gpt-neox/megatron/training.py", line 282, in setup_model_and_optimizer
10.141.246.144:     model, optimizer, _, lr_scheduler = deepspeed.initialize(
10.141.246.144:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/__init__.py", line 112, in initialize
10.141.246.144:     engine = DeepSpeedEngine(args=args,
10.141.246.144:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/engine.py", line 183, in __init__
10.141.246.144:     self._configure_optimizer(optimizer, model_parameters)
10.141.246.144:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/engine.py", line 612, in _configure_optimizer
10.141.246.144:     self.optimizer = self._configure_zero_optimizer(basic_optimizer)
10.141.246.144:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/engine.py", line 762, in _configure_zero_optimizer
10.141.246.144:     optimizer = FP16_DeepSpeedZeroOptimizer_Stage3(
10.141.246.144:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/zero/stage3.py", line 726, in __init__
10.141.246.144:     self.initialize_optimizer_states()
10.141.246.144:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/zero/stage3.py", line 1221, in initialize_optimizer_states
10.141.246.144:     self._optimizer_step(i)
10.141.246.144:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/zero/stage3.py", line 1186, in _optimizer_step
10.141.246.144:     self.optimizer.step()
10.141.246.144:   File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 89, in wrapper
10.141.246.144:     return func(*args, **kwargs)
10.141.246.144:   File "/usr/local/lib/python3.8/dist-packages/apex/optimizers/fused_adam.py", line 159, in step
10.141.246.144:     multi_tensor_applier(self.multi_tensor_adam,
10.141.246.144:   File "/usr/local/lib/python3.8/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
10.141.246.144:     return op(self.chunk_size,
10.141.246.144: RuntimeError: expected input to be on cuda

I thought that this was connected to the line with deepspeed.zero.Init(data_parallel_group=mpu.get_data_parallel_group(), remote_device="cpu"): in pretrain_gpt2.py but taking that out doesn't solve the issue.

StellaAthena · 2021-03-09T23:46:51Z

Updates from further testing indicates that this is a Pipeline Parallelism problem:

Setting PP = 0, ZeRO = 3 gives RuntimeError: expected input to be on cuda
Setting PP = 0, ZeRO = 1 gives AttributeError: 'tuple' object has no attribute 'float'
Setting PP = 1 runs, regardless of ZeRO

StellaAthena · 2021-03-10T06:25:29Z

Okay, so the problem with turning PP off is that this line returns (tensor, None). It does this regardless of what the settings for ZeRO are. In ZeRO 1, if I add [0] to the end it trains but cannot log. In ZeRO 3 it breaks with a complicated error signature.

Comparing to the MSFT repository, I see that they just always use weight tying which avoids this problem entirely. This portion of our code is otherwise the same. Turing off weight tying for us produces a new error

10.141.246.153: Traceback (most recent call last):
10.141.246.153:   File "pretrain_gpt2.py", line 191, in <module>
10.141.246.153:     pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
10.141.246.153:   File "/home/mchorse/gpt-neox/megatron/training.py", line 109, in pretrain
10.141.246.153:     iteration = train(forward_step_func,
10.141.246.153:   File "/home/mchorse/gpt-neox/megatron/training.py", line 547, in train
10.141.246.153:     loss_dict, skipped_iter = train_step(forward_step_func,
10.141.246.153:   File "/home/mchorse/gpt-neox/megatron/training.py", line 378, in train_step
10.141.246.153:     backward_step(optimizer, model, loss)
10.141.246.153:   File "/home/mchorse/gpt-neox/megatron/training.py", line 323, in backward_step
10.141.246.153:     model.backward(loss)
10.141.246.153:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/engine.py", line 988, in backward
10.141.246.153:     self.optimizer.backward(loss)
10.141.246.153:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/zero/stage3.py", line 2532, in backward
10.141.246.153:     self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
10.141.246.153:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward
10.141.246.153:     scaled_loss.backward(retain_graph=retain_graph)
10.141.246.153:   File "/usr/local/lib/python3.8/dist-packages/torch/tensor.py", line 245, in backward
10.141.246.153:     torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
10.141.246.153:   File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 145, in backward
10.141.246.153:     Variable._execution_engine.run_backward(
10.141.246.153:   File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 89, in apply
10.141.246.153:     return self._forward_cls.backward(self, *args)  # type: ignore
10.141.246.153:   File "/home/mchorse/gpt-neox/megatron/mpu/random.py", line 310, in backward
10.141.246.153:     torch.autograd.backward(outputs, args)
10.141.246.153:   File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 145, in backward
10.141.246.153:     Variable._execution_engine.run_backward(
10.141.246.153:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/zero/stage3.py", line 1398, in reduce_partition_and_remove_grads
10.141.246.153:     self.reduce_ready_partitions_and_remove_grads(param, i)
10.141.246.153:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/zero/stage3.py", line 1715, in reduce_ready_partitions_and_remove_grads
10.141.246.153:     self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
10.141.246.153:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/zero/stage3.py", line 1428, in reduce_independent_p_g_buckets_and_remove_grads
10.141.246.153:     self.reduce_ipg_grads()
10.141.246.153:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/zero/stage3.py", line 1685, in reduce_ipg_grads
10.141.246.153:     self.partition_previous_reduced_grads()
10.141.246.153:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/zero/stage3.py", line 1665, in partition_previous_reduced_grads
10.141.246.153:     self.async_inplace_copy_grad_to_fp32_buffer_from_gpu(
10.141.246.153:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/zero/stage3.py", line 1566, in async_inplace_copy_grad_to_fp32_buffer_from_gpu
10.141.246.153:     fp32_grad_tensor.copy_(src_tensor, non_blocking=True)
10.141.246.153: RuntimeError: The size of tensor a (171) must match the size of tensor b (169) at non-singleton dimension 0

I haven't looked into this at all but it seems like this might be fixed by forcing weight tying to be on in a less ad hoc fashion.

StellaAthena · 2021-05-14T04:45:06Z

We made significant changes to the codebase and MSFT made significant changes to DS so I am effectively starting from scratch. Wheeeeeeee

StellaAthena · 2021-06-20T19:46:26Z

We are deprioritizing ZeRO 3 based on feedback from NVIDIA. This may be implemented in the future, but not for a while I suspect.

StellaAthena added feature request New feature or request experiments Experiments we wish to perform on the codebase labels Mar 8, 2021

StellaAthena self-assigned this Mar 8, 2021

StellaAthena added this to To do in 1T or BUST via automation Mar 8, 2021

StellaAthena moved this from To do to In progress in 1T or BUST Mar 9, 2021

StellaAthena linked a pull request Mar 10, 2021 that will close this issue

Bugfix: Fix error when pp=0 and no-weight-tying=True #172

Merged

StellaAthena mentioned this issue Mar 25, 2021

ZeRO-3 Goes Brrrrr #199

Closed

StellaAthena linked a pull request Mar 25, 2021 that will close this issue

ZeRO-3 Goes Brrrrr #199

Closed

StellaAthena closed this as completed Dec 4, 2021

1T or BUST automation moved this from In progress to Done Dec 4, 2021

pwstegman mentioned this issue Dec 10, 2021

ZeRO 2 cpu_offload causes RuntimeError: expected input to be on cuda #478

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Take ZeRO 3 for a test drive #171

Take ZeRO 3 for a test drive #171

StellaAthena commented Mar 8, 2021

StellaAthena commented Mar 9, 2021

StellaAthena commented Mar 9, 2021

StellaAthena commented Mar 9, 2021

StellaAthena commented Mar 10, 2021

StellaAthena commented May 14, 2021

StellaAthena commented Jun 20, 2021

Take ZeRO 3 for a test drive #171

Take ZeRO 3 for a test drive #171

Comments

StellaAthena commented Mar 8, 2021

StellaAthena commented Mar 9, 2021

StellaAthena commented Mar 9, 2021

StellaAthena commented Mar 9, 2021

StellaAthena commented Mar 10, 2021

StellaAthena commented May 14, 2021

StellaAthena commented Jun 20, 2021