Skip to content

Latest commit

 

History

History
33 lines (22 loc) · 1.65 KB

README-AGD.md

File metadata and controls

33 lines (22 loc) · 1.65 KB

AGD Optimizer

an Auto-switchable Optimizer using Stepwise Gradient Difference as Preconditioning Matrix

We present PyTorch code for AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference as Preconditioning Matrix, NeurIPS 2023.

AGD employs the gradient difference between the current and previous steps to form the preconditioning matrix, which can dynamically transition between the adaptive and stochastic forms through an automated switching mechanism. Thanks to these dual approaches, AGD attains swifter convergence and superior generalization performance compared to state-of-the-art optimizers.

Toy example on

Usage

AGD can be a drop-in replacement for AdamW.

from atorch.optimizers.agd import AGD

Hyperparameters

  • lr: Empirically set to 1/10 of AdamW's value.
  • delta: Please refer to the settings in the paper. For Transformer-like models, you can typically keep the default value at 1e-14.
  • clip: Generally, there's no need to set it, but if you encounter training instability, you can try clip=5.
  • Others: Set them based on general empirical guidelines.

AGD's performance on nanoGPT

Given the popularity of large-scale models, we also tested the effectiveness of AGD on nanoGPT. As expected, AGD converges very quickly, providing up to a 1.5x acceleration compared to AdamW. This can significantly save training time and reduce training costs.