Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lion Optimizer #1062

Merged
merged 3 commits into from
Oct 20, 2023
Merged

Lion Optimizer #1062

merged 3 commits into from
Oct 20, 2023

Conversation

andylolu2
Copy link
Contributor

Picking up from #1012

@CLAassistant
Copy link

CLAassistant commented Oct 20, 2023

CLA assistant check
All committers have signed the CLA.

@andylolu2
Copy link
Contributor Author

Initial results:

Configs

Lion:

{
"pipe_parallel_size": 1,
"model_parallel_size": 1,

// model settings
"num_layers": 10,
"hidden_size": 512,
"num_attention_heads": 8,
"seq_length": 512,
"max_position_embeddings": 512,
"pos_emb": "rotary",
"no_weight_tying": true,
"gpt_j_residual": false,
"output_layer_parallelism": "column",

"scaled_upper_triang_masked_softmax_fusion": false,
"bias_gelu_fusion": false,

// init methods
"init_method": "small_init",
"output_layer_init_method": "wang_init",

"optimizer": {
  "type": "Lion",
  "params": {
    "lr": 0.00033,
    "betas": [0.9, 0.95],
  }
},
"min_lr": 0.000033,

// for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
"zero_optimization": {
  "stage": 1,
  "allgather_partitions": True,
  "allgather_bucket_size": 500000000,
  "overlap_comm": True,
  "reduce_scatter": True,
  "reduce_bucket_size": 500000000,
  "contiguous_gradients": True,
},

"train_micro_batch_size_per_gpu": 2, #32,
"gas": 1,
"data_impl": "mmap",
"num_workers": 1,

// activation checkpointing
"checkpoint_activations": true,
"checkpoint_num_layers": 1,
"partition_activations": true,
"synchronize_each_layer": true,

// regularization
"gradient_clipping": 1.0,
"weight_decay": 1,
"hidden_dropout": 0,
"attention_dropout": 0,

// precision settings
"fp16": {
  "fp16": true,
  "enabled": true,
  "loss_scale": 0,
  "loss_scale_window": 1000,
  "initial_scale_power": 12,
  "hysteresis": 2,
  "min_loss_scale": 1,
},

"train_iters": 5000,
"lr_decay_iters": 5000,
"distributed_backend": "nccl",
"lr_decay_style": "cosine",
"warmup": 0.01,
"checkpoint_factor": 1000,
"eval_interval": 100000,
"eval_iters": 10,

"log_interval": 10,
"steps_per_print": 10,
"wall_clock_breakdown": true,

// additional deepspeed args not specified above
"deepspeed_extra_args": {
  "zero_allow_untested_optimizer": true,
  "comms_logger": {
      "enabled": true,
      "verbose": true,
      "prof_all": true,
      "debug": false
  },
}

}

AdamW:

{
"pipe_parallel_size": 1,
"model_parallel_size": 1,

// model settings
"num_layers": 10,
"hidden_size": 512,
"num_attention_heads": 8,
"seq_length": 512,
"max_position_embeddings": 512,
"pos_emb": "rotary",
"no_weight_tying": true,
"gpt_j_residual": false,
"output_layer_parallelism": "column",

"scaled_upper_triang_masked_softmax_fusion": false,
"bias_gelu_fusion": false,

// init methods
"init_method": "small_init",
"output_layer_init_method": "wang_init",

"optimizer": {
  "type": "Adam",
  "params": {
    "lr": 0.001,
    "betas": [0.9, 0.95],
    "eps": 1.0e-8,
  }
},
"min_lr": 0.0001,

// for all zero_optimization options, see https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
"zero_optimization": {
  "stage": 1,
  "allgather_partitions": True,
  "allgather_bucket_size": 500000000,
  "overlap_comm": True,
  "reduce_scatter": True,
  "reduce_bucket_size": 500000000,
  "contiguous_gradients": True,
},

"train_micro_batch_size_per_gpu": 2, #32
"gas": 1,
"data_impl": "mmap",
"num_workers": 1,

// activation checkpointing
"checkpoint_activations": true,
"checkpoint_num_layers": 1,
"partition_activations": true,
"synchronize_each_layer": true,

// regularization
"gradient_clipping": 1.0,
"weight_decay": 0.1,
"hidden_dropout": 0,
"attention_dropout": 0,

// precision settings
"fp16": {
  "fp16": true,
  "enabled": true,
  "loss_scale": 0,
  "loss_scale_window": 1000,
  "initial_scale_power": 12,
  "hysteresis": 2,
  "min_loss_scale": 1,
},

"train_iters": 5000,
"lr_decay_iters": 5000,
"distributed_backend": "nccl",
"lr_decay_style": "cosine",
"warmup": 0.01,
"checkpoint_factor": 1000,
"eval_interval": 100000,
"eval_iters": 10,

"log_interval": 10,
"steps_per_print": 10,
"wall_clock_breakdown": true,

// additional deepspeed args not specified above
"deepspeed_extra_args": {
  "comms_logger": {
      "enabled": true,
      "verbose": true,
      "prof_all": true,
      "debug": false
  },
}

}

image

@andylolu2 andylolu2 mentioned this pull request Oct 20, 2023
@andylolu2
Copy link
Contributor Author

andylolu2 commented Oct 20, 2023

  • The memory improvement is quite small, which might be expected since activations dominate memory usage anyway.
  • No obvious improvement in terms of convergence, but these are only tiny experiments.

I'll probably need some compute resources to run larger experiments (if needed).

@Quentin-Anthony
Copy link
Member

I think these results demonstrate correctness. Performance can be ironed out later on the target system if/when we use this to train a real model.

@Quentin-Anthony
Copy link
Member

Thank you @kamathis4 and @andylolu2!

@Quentin-Anthony Quentin-Anthony merged commit b02d989 into EleutherAI:main Oct 20, 2023
2 checks passed
@adi-kmt
Copy link
Contributor

adi-kmt commented Oct 20, 2023

Thanks @andylolu2 for covering up!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants