Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Lion with Zero Optimizer #1166

Merged
merged 5 commits into from
Mar 4, 2024

Conversation

DayOfThePenguin
Copy link
Contributor

Currently, if you try to use the following optimizer and zero_optimizer settings:

   "optimizer": {
     "type": "Lion",
     "params": {
       "lr": 0.0006,
       "betas": [0.9, 0.999],
     }
   },
   "zero_optimization": {
    "stage": 1,
    "allgather_partitions": True,
    "allgather_bucket_size": 500000000,
    "overlap_comm": True,
    "reduce_scatter": True,
    "reduce_bucket_size": 500000000,
    "contiguous_gradients": True,
  },

You will encounter:

[2024-03-01 14:35:58,937] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = Lion
    assert (
AssertionError: You are using an untested ZeRO Optimizer. Please add <"zero_allow_untested_optimizer": true> in the configuration file to use it.
[2024-03-01 14:35:58,937] [INFO] [utils.py:54:is_zero_supported_optimizer] Checking ZeRO support for optimizer=Lion type=<class 'megatron.optimizers.Lion'>
Traceback (most recent call last):
  File "/gpt-neox/train.py", line 34, in <module>
    main()
  File "/gpt-neox/train.py", line 30, in main
    pretrain(neox_args=neox_args)
  File "/gpt-neox/megatron/training.py", line 194, in pretrain
    model, optimizer, lr_scheduler = setup_model_and_optimizer(
  File "/gpt-neox/megatron/training.py", line 667, in setup_model_and_optimizer
    model, optimizer, _, lr_scheduler = deepspeed.initialize(
  File "/venv/lib64/python3.9/site-packages/deepspeed/__init__.py", line 180, in initialize
    engine = PipelineEngine(args=args,
  File "/venv/lib64/python3.9/site-packages/deepspeed/runtime/pipe/engine.py", line 55, in __init__
    super().__init__(*super_args, **super_kwargs)
  File "/venv/lib64/python3.9/site-packages/deepspeed/runtime/engine.py", line 309, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/venv/lib64/python3.9/site-packages/deepspeed/runtime/engine.py", line 1177, in _configure_optimizer
    optimizer_wrapper = self._do_optimizer_sanity_check(basic_optimizer)
  File "/venv/lib64/python3.9/site-packages/deepspeed/runtime/engine.py", line 1106, in _do_optimizer_sanity_check
    assert (
AssertionError: You are using an untested ZeRO Optimizer. Please add <"zero_allow_untested_optimizer": true> in the configuration file to use it.

This PR fixes that issue by selecting the correct version of Lion depending on whether zero is enabled.

The current version of DeeperSpeed that's pinned in requirements.txt doesn't incorporate the upstream DeepSpeed changes that include FusedLion, so a DeeperSpeed version bump is also necessary. Recommend waiting to merge this until EleutherAI/DeeperSpeed#60 is merged and the DeeperSpeed version in requirements.txt can be bumped to that merge commit. That PR + this one have been tested for pipeline and tensor parallel training with FusedLion. Verified original behavior (megatron.optimizers.Lion) in zero_optimizer["stage"] = 0 case.

zero_optimizer["stage"] = 1 case

[2024-03-02 11:08:33,052] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Basic Optimizer = FusedLion
[2024-03-02 11:08:33,052] [INFO] [utils.py:56:is_zero_supported_optimizer] Checking ZeRO support for optimizer=FusedLion type=<class 'deepspeed.ops.lion.fused_lion.FusedLion'>
[2024-03-02 11:08:33,052] [INFO] [logging.py:96:log_dist] [Rank 0]

zero_optimizer["stage"] = 0 case

[2024-03-02 11:19:04,115] [INFO] [logging.py:96:log_dist] [Rank 0] DeepSpeed Final Optimizer = Lion

@DayOfThePenguin
Copy link
Contributor Author

DayOfThePenguin commented Mar 3, 2024

@Quentin-Anthony I think this is good to go now, verified pipeline logging error is gone and no regression in the zero_optimizer["stage"] = 0 case

@Quentin-Anthony Quentin-Anthony merged commit df8cf24 into EleutherAI:main Mar 4, 2024
2 of 5 checks passed
@DayOfThePenguin DayOfThePenguin deleted the zero-lion branch March 27, 2024 15:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants