Skip to content

[ICML 2023] Decentralized SGD and Average-direction SAM are Asymptotically Equivalent

License

Notifications You must be signed in to change notification settings

Raiden-Zhu/ICML-2023-DSGD-and-SAM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

[ICML 2023] Decentralized SGD and Average-direction SAM are Asymptotically Equivalent

License: MIT arXiv blog twitter slides poster

The Best of All Worlds: Embracing Decentralization for Improved Communication Efficiency, Privacy, and Generalization

The repository contains the offical implementation of the paper

[ICML 2023] Decentralized SGD and Average-direction SAM are Asymptotically Equivalent

image

Overview

Motivating question: The Best of All Worlds? Can we guarantee communication effiency, privacy and generalizablity all at once? Our recent ICML 2023 paper proves that decentralized training might be the anwer!

TLDR: The first work on the surprising sharpness-aware minimization nature of decentralized learning. We provide a completely new perspective to understand decentralization, which helps to bridge the gap between theory and practice in decentralized learning.

Abstract: Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server. However, existing theories claim that decentralization invariably undermines generalization. In this paper, we challenge the conventional belief and present a completely new perspective for understanding decentralized learning. We prove that D-SGD implicitly minimizes the loss function of an average-direction Sharpness-aware minimization (SAM) algorithm under general non-convex non-$\beta$-smooth settings. This surprising asymptotic equivalence reveals an intrinsic regularization-optimization trade-off and three advantages of decentralization: (1) there exists a free uncertainty evaluation mechanism in D-SGD to improve posterior estimation; (2) D-SGD exhibits a gradient smoothing effect; and (3) the sharpness regularization effect of D-SGD does not decrease as total batch size increases, which justifies the potential generalization benefit of D-SGD over centralized SGD (C-SGD) in large-batch scenarios.

image

image

Environment Setup

Requisite packages can be installed directly from the requirements.txt.

pip install -r requirements.txt

Example of usage

Train ResNet-18 on CIFAR-10 using D-SGD and C-SGD with 1024 total batch sizes.

python main.py --dataset_name "CIFAR10" --image_size 56 --batch_size 64 --mode "csgd" --size 16 --lr 0.1 --model "ResNet18_M" --warmup_step 60 --milestones 2400 4800 --early_stop 6000 --epoch 6000 --seed 666 --pretrained 1 --device 0

python main.py --dataset_name "CIFAR10" --image_size 56 --batch_size 64 --mode "ring" --size 16 --lr 0.1 --model "ResNet18_M" --warmup_step 60 --milestones 2400 4800 --early_stop 6000 --epoch 6000 --seed 666 --pretrained 1 --device 0

Train ResNet-18 on CIFAR-10 using D-SGD and C-SGD with 8192 total batch sizes.

python main.py --dataset_name "CIFAR10" --image_size 56 --batch_size 512 --mode "csgd" --size 16 --lr 0.8 --model "ResNet18_M" --warmup_step 60 --milestones 2400 4800 --early_stop 6000 --epoch 6000 --seed 666 --pretrained 1 --device 0

python main.py --dataset_name "CIFAR10" --image_size 56 --batch_size 512 --mode "ring" --size 16 --lr 0.8 --model "ResNet18_M" --warmup_step 60 --milestones 2400 4800 --early_stop 6000 --epoch 6000 --seed 666 --pretrained 1 --device 0

More detailed scripts can be found in the "scripts" folder.

The 3D local loss landscape visualization is based on visualization.

image

Citing this repository

Please cite our paper if you find this repo useful in your work:


@InProceedings{pmlr-v202-zhu23e,
  title = 	 {Decentralized {SGD} and Average-direction {SAM} are Asymptotically Equivalent},
  author =       {Zhu, Tongtian and He, Fengxiang and Chen, Kaixuan and Song, Mingli and Tao, Dacheng},
  booktitle = 	 {Proceedings of the 40th International Conference on Machine Learning},
  pages = 	 {43005--43036},
  year = 	 {2023},
  editor = 	 {Krause, Andreas and Brunskill, Emma and Cho, Kyunghyun and Engelhardt, Barbara and Sabato, Sivan and Scarlett, Jonathan},
  volume = 	 {202},
  series = 	 {Proceedings of Machine Learning Research},
  month = 	 {23--29 Jul},
  publisher =    {PMLR},
  pdf = 	 {https://proceedings.mlr.press/v202/zhu23e/zhu23e.pdf},
  url = 	 {https://proceedings.mlr.press/v202/zhu23e.html},
  abstract = 	 {Decentralized stochastic gradient descent (D-SGD) allows collaborative learning on massive devices simultaneously without the control of a central server. However, existing theories claim that decentralization invariably undermines generalization. In this paper, we challenge the conventional belief and present a completely new perspective for understanding decentralized learning. We prove that D-SGD implicitly minimizes the loss function of an average-direction Sharpness-aware minimization (SAM) algorithm under general non-convex non-$\beta$-smooth settings. This surprising asymptotic equivalence reveals an intrinsic regularization-optimization trade-off and three advantages of decentralization: (1) there exists a free uncertainty evaluation mechanism in D-SGD to improve posterior estimation; (2) D-SGD exhibits a gradient smoothing effect; and (3) the sharpness regularization effect of D-SGD does not decrease as total batch size increases, which justifies the potential generalization benefit of D-SGD over centralized SGD (C-SGD) in large-batch scenarios.}
}

Contact

Please feel free to contact via email ([email protected]) or Wechat (RaidenT_T) if you have any questions.