Lion Optimizer #1062

andylolu2 · 2023-10-20T01:21:20Z

Picking up from #1012

CLAassistant · 2023-10-20T01:21:28Z

All committers have signed the CLA.

andylolu2 · 2023-10-20T01:30:41Z

Initial results:

Configs

Lion:

{
"pipe_parallel_size": 1,
"model_parallel_size": 1,

// model settings
"num_layers": 10,
"hidden_size": 512,
"num_attention_heads": 8,
"seq_length": 512,
"max_position_embeddings": 512,
"pos_emb": "rotary",
"no_weight_tying": true,
"gpt_j_residual": false,
"output_layer_parallelism": "column",

"scaled_upper_triang_masked_softmax_fusion": false,
"bias_gelu_fusion": false,

// init methods
"init_method": "small_init",
"output_layer_init_method": "wang_init",

"optimizer": {
  "type": "Lion",
  "params": {
    "lr": 0.00033,
    "betas": [0.9, 0.95],
  }
},
"min_lr": 0.000033,

// for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
"zero_optimization": {
  "stage": 1,
  "allgather_partitions": True,
  "allgather_bucket_size": 500000000,
  "overlap_comm": True,
  "reduce_scatter": True,
  "reduce_bucket_size": 500000000,
  "contiguous_gradients": True,
},

"train_micro_batch_size_per_gpu": 2, #32,
"gas": 1,
"data_impl": "mmap",
"num_workers": 1,

// activation checkpointing
"checkpoint_activations": true,
"checkpoint_num_layers": 1,
"partition_activations": true,
"synchronize_each_layer": true,

// regularization
"gradient_clipping": 1.0,
"weight_decay": 1,
"hidden_dropout": 0,
"attention_dropout": 0,

// precision settings
"fp16": {
  "fp16": true,
  "enabled": true,
  "loss_scale": 0,
  "loss_scale_window": 1000,
  "initial_scale_power": 12,
  "hysteresis": 2,
  "min_loss_scale": 1,
},

"train_iters": 5000,
"lr_decay_iters": 5000,
"distributed_backend": "nccl",
"lr_decay_style": "cosine",
"warmup": 0.01,
"checkpoint_factor": 1000,
"eval_interval": 100000,
"eval_iters": 10,

"log_interval": 10,
"steps_per_print": 10,
"wall_clock_breakdown": true,

// additional deepspeed args not specified above
"deepspeed_extra_args": {
  "zero_allow_untested_optimizer": true,
  "comms_logger": {
      "enabled": true,
      "verbose": true,
      "prof_all": true,
      "debug": false
  },
}

}

AdamW:

{
"pipe_parallel_size": 1,
"model_parallel_size": 1,

// model settings
"num_layers": 10,
"hidden_size": 512,
"num_attention_heads": 8,
"seq_length": 512,
"max_position_embeddings": 512,
"pos_emb": "rotary",
"no_weight_tying": true,
"gpt_j_residual": false,
"output_layer_parallelism": "column",

"scaled_upper_triang_masked_softmax_fusion": false,
"bias_gelu_fusion": false,

// init methods
"init_method": "small_init",
"output_layer_init_method": "wang_init",

"optimizer": {
  "type": "Adam",
  "params": {
    "lr": 0.001,
    "betas": [0.9, 0.95],
    "eps": 1.0e-8,
  }
},
"min_lr": 0.0001,

// for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
"zero_optimization": {
  "stage": 1,
  "allgather_partitions": True,
  "allgather_bucket_size": 500000000,
  "overlap_comm": True,
  "reduce_scatter": True,
  "reduce_bucket_size": 500000000,
  "contiguous_gradients": True,
},

"train_micro_batch_size_per_gpu": 2, #32
"gas": 1,
"data_impl": "mmap",
"num_workers": 1,

// activation checkpointing
"checkpoint_activations": true,
"checkpoint_num_layers": 1,
"partition_activations": true,
"synchronize_each_layer": true,

// regularization
"gradient_clipping": 1.0,
"weight_decay": 0.1,
"hidden_dropout": 0,
"attention_dropout": 0,

// precision settings
"fp16": {
  "fp16": true,
  "enabled": true,
  "loss_scale": 0,
  "loss_scale_window": 1000,
  "initial_scale_power": 12,
  "hysteresis": 2,
  "min_loss_scale": 1,
},

"train_iters": 5000,
"lr_decay_iters": 5000,
"distributed_backend": "nccl",
"lr_decay_style": "cosine",
"warmup": 0.01,
"checkpoint_factor": 1000,
"eval_interval": 100000,
"eval_iters": 10,

"log_interval": 10,
"steps_per_print": 10,
"wall_clock_breakdown": true,

// additional deepspeed args not specified above
"deepspeed_extra_args": {
  "comms_logger": {
      "enabled": true,
      "verbose": true,
      "prof_all": true,
      "debug": false
  },
}

}

andylolu2 · 2023-10-20T01:38:09Z

The memory improvement is quite small, which might be expected since activations dominate memory usage anyway.
No obvious improvement in terms of convergence, but these are only tiny experiments.

I'll probably need some compute resources to run larger experiments (if needed).

Quentin-Anthony · 2023-10-20T02:19:47Z

I think these results demonstrate correctness. Performance can be ironed out later on the target system if/when we use this to train a real model.

Quentin-Anthony · 2023-10-20T02:20:08Z

Thank you @kamathis4 and @andylolu2!

adi-kmt · 2023-10-20T04:38:04Z

Thanks @andylolu2 for covering up!!

adi-kmt and others added 3 commits August 13, 2023 11:39

initial commit

2131b81

test set, fixed readme and docstring

7afa3ee

Refactor Lion implementation

f3a2dd6

andylolu2 requested a review from a team as a code owner October 20, 2023 01:21

andylolu2 requested review from Quentin-Anthony and ShivanshuPurohit October 20, 2023 01:21

andylolu2 mentioned this pull request Oct 20, 2023

Lion Optimizer #1012

Closed

Quentin-Anthony approved these changes Oct 20, 2023

View reviewed changes

Quentin-Anthony merged commit b02d989 into EleutherAI:main Oct 20, 2023
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lion Optimizer #1062

Lion Optimizer #1062

andylolu2 commented Oct 20, 2023

CLAassistant commented Oct 20, 2023 •

edited

Loading

andylolu2 commented Oct 20, 2023

andylolu2 commented Oct 20, 2023 •

edited

Loading

Quentin-Anthony commented Oct 20, 2023

Quentin-Anthony commented Oct 20, 2023

adi-kmt commented Oct 20, 2023

Lion Optimizer #1062

Lion Optimizer #1062

Conversation

andylolu2 commented Oct 20, 2023

CLAassistant commented Oct 20, 2023 • edited Loading

andylolu2 commented Oct 20, 2023

andylolu2 commented Oct 20, 2023 • edited Loading

Quentin-Anthony commented Oct 20, 2023

Quentin-Anthony commented Oct 20, 2023

adi-kmt commented Oct 20, 2023

CLAassistant commented Oct 20, 2023 •

edited

Loading

andylolu2 commented Oct 20, 2023 •

edited

Loading