Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Take ZeRO 3 for a test drive #171

Closed
StellaAthena opened this issue Mar 8, 2021 · 6 comments · Fixed by #172
Closed

Take ZeRO 3 for a test drive #171

StellaAthena opened this issue Mar 8, 2021 · 6 comments · Fixed by #172
Assignees
Labels
experiments Experiments we wish to perform on the codebase feature request New feature or request
Projects

Comments

@StellaAthena
Copy link
Member

Is your feature request related to a problem? Please describe.
Model too smol

Describe the solution you'd like
DeepSpeed ZeRO-3 is finally public! Let’s take it for a test drive (remember to turn pipeline parallelism off) and see if we can get it to run.

Describe alternatives you've considered
Use Pipeline Parallelism

Additional context
It probably doesn’t work or needs to be modded to work because DeepSpeed :works-internally:

@StellaAthena StellaAthena added feature request New feature or request experiments Experiments we wish to perform on the codebase labels Mar 8, 2021
@StellaAthena StellaAthena self-assigned this Mar 8, 2021
@StellaAthena StellaAthena added this to To do in 1T or BUST via automation Mar 8, 2021
@StellaAthena StellaAthena moved this from To do to In progress in 1T or BUST Mar 9, 2021
@StellaAthena
Copy link
Member Author

Code is implemented, running into some compatibility issues due to our mods. Shouldn’t be hard to resolve.

@StellaAthena
Copy link
Member Author

Note: Do not use scientific notation in the config file. The config monster parses it as a string instead of a number (cc: @joshlk)

Current error is

10.141.246.144: Traceback (most recent call last):
10.141.246.144:   File "pretrain_gpt2.py", line 193, in <module>
10.141.246.144:     pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
10.141.246.144:   File "/home/mchorse/gpt-neox/megatron/training.py", line 92, in pretrain
10.141.246.144:     model, optimizer, lr_scheduler = setup_model_and_optimizer(model_provider)
10.141.246.144:   File "/home/mchorse/gpt-neox/megatron/training.py", line 282, in setup_model_and_optimizer
10.141.246.144:     model, optimizer, _, lr_scheduler = deepspeed.initialize(
10.141.246.144:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/__init__.py", line 112, in initialize
10.141.246.144:     engine = DeepSpeedEngine(args=args,
10.141.246.144:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/engine.py", line 183, in __init__
10.141.246.144:     self._configure_optimizer(optimizer, model_parameters)
10.141.246.144:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/engine.py", line 612, in _configure_optimizer
10.141.246.144:     self.optimizer = self._configure_zero_optimizer(basic_optimizer)
10.141.246.144:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/engine.py", line 762, in _configure_zero_optimizer
10.141.246.144:     optimizer = FP16_DeepSpeedZeroOptimizer_Stage3(
10.141.246.144:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/zero/stage3.py", line 726, in __init__
10.141.246.144:     self.initialize_optimizer_states()
10.141.246.144:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/zero/stage3.py", line 1221, in initialize_optimizer_states
10.141.246.144:     self._optimizer_step(i)
10.141.246.144:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/zero/stage3.py", line 1186, in _optimizer_step
10.141.246.144:     self.optimizer.step()
10.141.246.144:   File "/usr/local/lib/python3.8/dist-packages/torch/optim/optimizer.py", line 89, in wrapper
10.141.246.144:     return func(*args, **kwargs)
10.141.246.144:   File "/usr/local/lib/python3.8/dist-packages/apex/optimizers/fused_adam.py", line 159, in step
10.141.246.144:     multi_tensor_applier(self.multi_tensor_adam,
10.141.246.144:   File "/usr/local/lib/python3.8/dist-packages/apex/multi_tensor_apply/multi_tensor_apply.py", line 27, in __call__
10.141.246.144:     return op(self.chunk_size,
10.141.246.144: RuntimeError: expected input to be on cuda

I thought that this was connected to the line with deepspeed.zero.Init(data_parallel_group=mpu.get_data_parallel_group(), remote_device="cpu"): in pretrain_gpt2.py but taking that out doesn't solve the issue.

@StellaAthena
Copy link
Member Author

Updates from further testing indicates that this is a Pipeline Parallelism problem:

  • Setting PP = 0, ZeRO = 3 gives RuntimeError: expected input to be on cuda
  • Setting PP = 0, ZeRO = 1 gives AttributeError: 'tuple' object has no attribute 'float'
  • Setting PP = 1 runs, regardless of ZeRO

@StellaAthena
Copy link
Member Author

Okay, so the problem with turning PP off is that this line returns (tensor, None). It does this regardless of what the settings for ZeRO are. In ZeRO 1, if I add [0] to the end it trains but cannot log. In ZeRO 3 it breaks with a complicated error signature.

Comparing to the MSFT repository, I see that they just always use weight tying which avoids this problem entirely. This portion of our code is otherwise the same. Turing off weight tying for us produces a new error

10.141.246.153: Traceback (most recent call last):
10.141.246.153:   File "pretrain_gpt2.py", line 191, in <module>
10.141.246.153:     pretrain(train_valid_test_datasets_provider, model_provider, forward_step,
10.141.246.153:   File "/home/mchorse/gpt-neox/megatron/training.py", line 109, in pretrain
10.141.246.153:     iteration = train(forward_step_func,
10.141.246.153:   File "/home/mchorse/gpt-neox/megatron/training.py", line 547, in train
10.141.246.153:     loss_dict, skipped_iter = train_step(forward_step_func,
10.141.246.153:   File "/home/mchorse/gpt-neox/megatron/training.py", line 378, in train_step
10.141.246.153:     backward_step(optimizer, model, loss)
10.141.246.153:   File "/home/mchorse/gpt-neox/megatron/training.py", line 323, in backward_step
10.141.246.153:     model.backward(loss)
10.141.246.153:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/engine.py", line 988, in backward
10.141.246.153:     self.optimizer.backward(loss)
10.141.246.153:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/zero/stage3.py", line 2532, in backward
10.141.246.153:     self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
10.141.246.153:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/fp16/loss_scaler.py", line 53, in backward
10.141.246.153:     scaled_loss.backward(retain_graph=retain_graph)
10.141.246.153:   File "/usr/local/lib/python3.8/dist-packages/torch/tensor.py", line 245, in backward
10.141.246.153:     torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
10.141.246.153:   File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 145, in backward
10.141.246.153:     Variable._execution_engine.run_backward(
10.141.246.153:   File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 89, in apply
10.141.246.153:     return self._forward_cls.backward(self, *args)  # type: ignore
10.141.246.153:   File "/home/mchorse/gpt-neox/megatron/mpu/random.py", line 310, in backward
10.141.246.153:     torch.autograd.backward(outputs, args)
10.141.246.153:   File "/usr/local/lib/python3.8/dist-packages/torch/autograd/__init__.py", line 145, in backward
10.141.246.153:     Variable._execution_engine.run_backward(
10.141.246.153:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/zero/stage3.py", line 1398, in reduce_partition_and_remove_grads
10.141.246.153:     self.reduce_ready_partitions_and_remove_grads(param, i)
10.141.246.153:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/zero/stage3.py", line 1715, in reduce_ready_partitions_and_remove_grads
10.141.246.153:     self.reduce_independent_p_g_buckets_and_remove_grads(param, i)
10.141.246.153:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/zero/stage3.py", line 1428, in reduce_independent_p_g_buckets_and_remove_grads
10.141.246.153:     self.reduce_ipg_grads()
10.141.246.153:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/zero/stage3.py", line 1685, in reduce_ipg_grads
10.141.246.153:     self.partition_previous_reduced_grads()
10.141.246.153:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/zero/stage3.py", line 1665, in partition_previous_reduced_grads
10.141.246.153:     self.async_inplace_copy_grad_to_fp32_buffer_from_gpu(
10.141.246.153:   File "/home/mchorse/gpt-neox/src/deepspeed/deepspeed/runtime/zero/stage3.py", line 1566, in async_inplace_copy_grad_to_fp32_buffer_from_gpu
10.141.246.153:     fp32_grad_tensor.copy_(src_tensor, non_blocking=True)
10.141.246.153: RuntimeError: The size of tensor a (171) must match the size of tensor b (169) at non-singleton dimension 0

I haven't looked into this at all but it seems like this might be fixed by forcing weight tying to be on in a less ad hoc fashion.

@StellaAthena StellaAthena linked a pull request Mar 10, 2021 that will close this issue
@StellaAthena StellaAthena linked a pull request Mar 25, 2021 that will close this issue
@StellaAthena
Copy link
Member Author

We made significant changes to the codebase and MSFT made significant changes to DS so I am effectively starting from scratch. Wheeeeeeee

@StellaAthena
Copy link
Member Author

We are deprioritizing ZeRO 3 based on feedback from NVIDIA. This may be implemented in the future, but not for a while I suspect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experiments Experiments we wish to perform on the codebase feature request New feature or request
Projects
1 participant