Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SGD foreach with momentum x PT2 regression #1665

Open
github-actions bot opened this issue May 19, 2023 · 5 comments
Open

SGD foreach with momentum x PT2 regression #1665

github-actions bot opened this issue May 19, 2023 · 5 comments

Comments

@github-actions
Copy link

TorchBench CI has detected a performance signal or runtime regression.

Base PyTorch commit: 174d01bc939c7bdf390113c75d5ec2ce84cfa1d2

Affected PyTorch commit: 329bb2a33e40f4bc76b2e061b180d3234984c91b

Affected Tests:

  • hf_GPT2, Adagrad, cuda, (pt2) default: -10.00795%
  • hf_GPT2, Adagrad, cuda, default: -20.60405%
  • hf_GPT2, Adagrad, cuda, foreach: -13.93744%
  • hf_GPT2, ASGD, cuda, (pt2) default: +12.04677%
  • hf_GPT2, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +14.10974%
  • hf_GPT2, SGD, cuda, (pt2) foreach, momentum=0.9: +13.43381%
  • squeezenet1_1, Adagrad, cuda, default: -13.30356%
  • squeezenet1_1, Adagrad, cuda, (pt2) foreach: -10.77666%
  • squeezenet1_1, AdamW, cuda, (pt2) foreach, maximize, capturable, amsgrad: +19.58979%
  • squeezenet1_1, AdamW, cuda, (pt2) fused, capturable, amsgrad: -12.95181%
  • squeezenet1_1, ASGD, cuda, maximize: -10.21890%
  • squeezenet1_1, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +23.30649%
  • squeezenet1_1, SGD, cuda, (pt2) foreach, momentum=0.9: +30.52701%
  • vgg16, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +19.98093%
  • vgg16, SGD, cuda, (pt2) foreach, momentum=0.9: +13.81549%
  • alexnet, Adadelta, cuda, (pt2) default: +12.28879%
  • alexnet, Adagrad, cuda, (pt2) maximize: +10.50578%
  • alexnet, Adagrad, cuda, (pt2) foreach: -13.71834%
  • detectron2_fasterrcnn_r_101_dc5, Adadelta, cuda, (pt2) maximize: +15.09218%
  • detectron2_fasterrcnn_r_101_dc5, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +19.41707%
  • detectron2_fasterrcnn_r_101_dc5, SGD, cuda, (pt2) foreach, momentum=0.9: +16.36527%
  • maml_omniglot, Adam, cuda, (pt2) foreach, maximize, capturable: +10.32854%
  • maml_omniglot, Adam, cuda, foreach, maximize, capturable: -12.44876%
  • maml_omniglot, Adam, cuda, fused: +10.73952%
  • maml_omniglot, Adam, cuda, (pt2) fused, capturable, amsgrad: +11.43319%
  • mobilenet_v2, Adam, cuda, (pt2) foreach: -14.93973%
  • mobilenet_v2, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +15.47728%
  • mobilenet_v2, SGD, cuda, (pt2) foreach, momentum=0.9: +20.00669%
  • mobilenet_v2, Rprop, cuda, default: -13.13734%
  • vision_maskrcnn, Adagrad, cuda, (pt2) maximize: +28.06755%
  • vision_maskrcnn, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +23.58648%
  • vision_maskrcnn, SGD, cuda, (pt2) foreach, momentum=0.9: +19.58112%
  • detectron2_maskrcnn_r_101_fpn, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +20.94936%
  • detectron2_maskrcnn_r_101_fpn, SGD, cuda, (pt2) foreach, momentum=0.9: +16.55558%
  • hf_Reformer, Adam, cuda, (pt2) amsgrad, maximize: +11.78429%
  • hf_Reformer, AdamW, cuda, (pt2) default: -10.03352%
  • hf_Reformer, ASGD, cuda, (pt2) default: -13.96709%
  • hf_Reformer, ASGD, cuda, foreach: -11.98059%
  • hf_Reformer, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +287.19047%
  • hf_Reformer, SGD, cuda, (pt2) foreach, momentum=0.9: +226.08197%
  • hf_DistilBert, Adagrad, cuda, (pt2) default: -16.88534%
  • hf_DistilBert, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +16.40926%
  • hf_DistilBert, SGD, cuda, (pt2) foreach, momentum=0.9: +12.85974%
  • soft_actor_critic, Adadelta, cuda, default: +12.74693%
  • soft_actor_critic, NAdam, cuda, foreach: +13.52852%
  • timm_nfnet, Adadelta, cuda, (pt2) default: -10.60020%
  • timm_nfnet, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +17.73666%
  • timm_nfnet, SGD, cuda, (pt2) foreach, momentum=0.9: +15.33856%
  • detectron2_fasterrcnn_r_101_c4, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +17.31830%
  • detectron2_fasterrcnn_r_101_c4, SGD, cuda, (pt2) foreach, momentum=0.9: +17.32255%
  • nvidia_deeprecommender, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +29.20597%
  • mobilenet_v3_large, Adadelta, cuda, (pt2) foreach: +12.27603%
  • mobilenet_v3_large, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +18.08728%
  • mobilenet_v3_large, SGD, cuda, (pt2) foreach, momentum=0.9: +18.16405%
  • hf_T5_large, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +17.09889%
  • hf_T5_large, SGD, cuda, (pt2) foreach, momentum=0.9: +16.95799%
  • hf_T5_large, RAdam, cuda, default: +11.08184%
  • hf_T5_large, RAdam, cuda, foreach: -10.12509%
  • hf_T5_large, NAdam, cuda, foreach: -11.92921%
  • hf_BigBird, Adadelta, cuda, (pt2) maximize: +13.72415%
  • hf_BigBird, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +21.02764%
  • hf_BigBird, SGD, cuda, (pt2) foreach, momentum=0.9: +15.23793%
  • mnasnet1_0, Adadelta, cuda, (pt2) foreach: -12.80595%
  • mnasnet1_0, ASGD, cuda, (pt2) maximize: -20.27974%
  • mnasnet1_0, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +17.77801%
  • mnasnet1_0, SGD, cuda, (pt2) foreach, momentum=0.9: +15.61177%
  • mnasnet1_0, Rprop, cuda, foreach: -10.25623%
  • densenet121, Adadelta, cuda, (pt2) default: +15.70140%
  • densenet121, Adamax, cuda, (pt2) default: -10.87436%
  • densenet121, ASGD, cuda, (pt2) default: -11.65889%
  • densenet121, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +17.07678%
  • densenet121, SGD, cuda, (pt2) foreach, momentum=0.9: +17.51475%
  • shufflenet_v2_x1_0, Adadelta, cuda, foreach: +10.39546%
  • shufflenet_v2_x1_0, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +15.40964%
  • shufflenet_v2_x1_0, SGD, cuda, (pt2) foreach, momentum=0.9: +18.38254%
  • shufflenet_v2_x1_0, Rprop, cuda, (pt2) maximize: -10.52315%
  • phlippe_densenet, Adagrad, cuda, maximize: -11.15475%
  • phlippe_densenet, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +13.67982%
  • phlippe_densenet, SGD, cuda, (pt2) foreach, momentum=0.9: +17.53691%
  • dcgan, Adam, cuda, (pt2) default: +10.05565%
  • dcgan, Adam, cuda, (pt2) foreach, maximize, capturable: +10.76761%
  • dcgan, Adam, cuda, (pt2) foreach, maximize, capturable, amsgrad: +15.04277%
  • dcgan, Adamax, cuda, (pt2) foreach: +25.90321%
  • dcgan, SGD, cuda, (pt2) foreach: +16.67097%
  • dcgan, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +77.15298%
  • basic_gnn_gin, Adagrad, cuda, (pt2) foreach: -10.46377%
  • basic_gnn_gin, Adam, cuda, (pt2) default: -11.44923%
  • basic_gnn_gin, Adam, cuda, (pt2) amsgrad, maximize: +12.51434%
  • resnet50_quantized_qat, Adadelta, cuda, (pt2) default: -11.85608%
  • resnet50_quantized_qat, Adadelta, cuda, (pt2) foreach: +12.84276%
  • resnet50_quantized_qat, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +18.31013%
  • resnet50_quantized_qat, SGD, cuda, (pt2) foreach, momentum=0.9: +19.50725%
  • resnet50_quantized_qat, Rprop, cuda, (pt2) default: -10.24632%
  • resnet50_quantized_qat, Rprop, cuda, (pt2) foreach: +17.93107%
  • resnet152, Adagrad, cuda, (pt2) foreach: -12.91312%
  • resnet152, ASGD, cuda, (pt2) default: +20.31393%
  • resnet152, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +18.03304%
  • resnet152, SGD, cuda, (pt2) foreach, momentum=0.9: +18.27350%
  • nanogpt_generate, Adam, cuda, (pt2) amsgrad, maximize: -10.01305%
  • nanogpt_generate, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +20.06935%
  • nanogpt_generate, SGD, cuda, (pt2) foreach, momentum=0.9: +17.22448%
  • nanogpt_generate, Rprop, cuda, foreach: +10.72682%
  • hf_T5, Adadelta, cuda, (pt2) maximize: +12.73955%
  • hf_T5, Adagrad, cuda, maximize: -15.67391%
  • hf_T5, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +19.05383%
  • hf_T5, SGD, cuda, (pt2) foreach, momentum=0.9: +18.82591%
  • hf_T5, Rprop, cuda, (pt2) foreach: +19.38245%
  • hf_T5, Rprop, cuda, foreach: -10.21894%
  • llama, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +15.80372%
  • llama, SGD, cuda, (pt2) foreach, momentum=0.9: +14.18009%
  • timm_vision_transformer, Adagrad, cuda, (pt2) foreach: +11.72334%
  • timm_vision_transformer, ASGD, cuda, (pt2) no_foreach: +12.45654%
  • timm_vision_transformer, ASGD, cuda, no_foreach: +15.29659%
  • timm_vision_transformer, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +19.41901%
  • timm_vision_transformer, SGD, cuda, (pt2) foreach, momentum=0.9: +15.89706%
  • timm_vision_transformer, Rprop, cuda, (pt2) default: +10.66033%
  • timm_vision_transformer, NAdam, cuda, (pt2) no_foreach: +12.80610%
  • yolov3, Adagrad, cuda, (pt2) foreach: -14.07733%
  • yolov3, Adam, cuda, (pt2) foreach, maximize, capturable, amsgrad: +18.02995%
  • yolov3, AdamW, cuda, default: +13.34917%
  • yolov3, AdamW, cuda, amsgrad, maximize: +12.14260%
  • yolov3, AdamW, cuda, (pt2) no_foreach: +20.35415%
  • yolov3, AdamW, cuda, (pt2) foreach, maximize, capturable: +12.58966%
  • yolov3, AdamW, cuda, (pt2) foreach, maximize, capturable, amsgrad: +10.14254%
  • yolov3, AdamW, cuda, foreach, maximize, capturable, amsgrad: +18.86200%
  • yolov3, AdamW, cuda, (pt2) fused, capturable, amsgrad: +14.30469%
  • yolov3, Adamax, cuda, (pt2) foreach: +11.27837%
  • yolov3, ASGD, cuda, (pt2) no_foreach: +24.09931%
  • yolov3, SGD, cuda, maximize: +14.66849%
  • yolov3, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +16.64589%
  • yolov3, SGD, cuda, (pt2) foreach, momentum=0.9: +19.77864%
  • hf_T5_base, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +20.31640%
  • hf_T5_base, SGD, cuda, (pt2) foreach, momentum=0.9: +18.69113%
  • pytorch_struct, Adagrad, cuda, (pt2) default: -14.34921%
  • pytorch_struct, Adam, cuda, (pt2) foreach, maximize, capturable, amsgrad: -13.08062%
  • hf_Bert_large, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +15.38457%
  • hf_Bert_large, SGD, cuda, (pt2) foreach, momentum=0.9: +16.36959%
  • detectron2_fasterrcnn_r_50_c4, Adamax, cuda, foreach: -12.99953%
  • detectron2_fasterrcnn_r_50_c4, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +56.32985%
  • detectron2_fasterrcnn_r_50_c4, SGD, cuda, (pt2) foreach, momentum=0.9: +55.49023%
  • demucs, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +71.03700%
  • demucs, SGD, cuda, (pt2) foreach, momentum=0.9: +49.39165%
  • pytorch_stargan, Adadelta, cuda, (pt2) maximize: +12.36882%
  • pytorch_stargan, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +18.39705%
  • pytorch_stargan, SGD, cuda, (pt2) foreach, momentum=0.9: +35.94717%
  • fambench_xlmr, AdamW, cuda, foreach: +10.03443%
  • fambench_xlmr, ASGD, cuda, foreach: -10.28177%
  • fambench_xlmr, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +16.77840%
  • fambench_xlmr, SGD, cuda, (pt2) foreach, momentum=0.9: +18.35859%
  • Super_SloMo, Adamax, cuda, (pt2) default: +26.57758%
  • Super_SloMo, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +16.90238%
  • Super_SloMo, SGD, cuda, (pt2) foreach, momentum=0.9: +13.43494%
  • Super_SloMo, Rprop, cuda, (pt2) default: -11.16735%
  • detectron2_maskrcnn_r_101_c4, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +23.79047%
  • detectron2_maskrcnn_r_101_c4, SGD, cuda, (pt2) foreach, momentum=0.9: +17.06438%
  • DALLE2_pytorch, ASGD, cuda, (pt2) maximize: -10.47660%
  • DALLE2_pytorch, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +16.90286%
  • DALLE2_pytorch, SGD, cuda, (pt2) foreach, momentum=0.9: +16.02788%
  • timm_vovnet, Adadelta, cuda, (pt2) maximize: +10.71432%
  • timm_vovnet, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +19.85040%
  • timm_vovnet, SGD, cuda, (pt2) foreach, momentum=0.9: +21.30767%
  • basic_gnn_gcn, Adadelta, cuda, (pt2) default: +15.59864%
  • basic_gnn_gcn, Adadelta, cuda, (pt2) maximize: +10.87873%
  • basic_gnn_gcn, Adagrad, cuda, (pt2) default: +12.28406%
  • basic_gnn_gcn, Adam, cuda, foreach: +13.15523%
  • basic_gnn_gcn, AdamW, cuda, (pt2) foreach: +12.00425%
  • basic_gnn_gcn, AdamW, cuda, foreach: +10.16019%
  • basic_gnn_gcn, SGD, cuda, (pt2) default: +37.00231%
  • detectron2_fasterrcnn_r_50_dc5, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +145.72344%
  • detectron2_fasterrcnn_r_50_dc5, SGD, cuda, (pt2) foreach, momentum=0.9: +98.52173%
  • resnet18, Adagrad, cuda, (pt2) maximize: -11.22009%
  • resnet18, Adam, cuda, (pt2) default: -12.51022%
  • resnet18, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +62.08407%
  • resnet18, SGD, cuda, (pt2) foreach, momentum=0.9: +47.31600%
  • basic_gnn_edgecnn, Adadelta, cuda, (pt2) foreach: +12.87523%
  • basic_gnn_edgecnn, Adam, cuda, (pt2) amsgrad, maximize: +14.25386%
  • basic_gnn_edgecnn, Adam, cuda, (pt2) foreach: -12.66487%
  • basic_gnn_edgecnn, Adamax, cuda, (pt2) foreach: -11.94945%
  • basic_gnn_edgecnn, Rprop, cuda, (pt2) maximize: -15.27691%
  • basic_gnn_edgecnn, Rprop, cuda, maximize: +13.75266%
  • fastNLP_Bert, ASGD, cuda, (pt2) default: +12.64274%
  • fastNLP_Bert, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +15.32005%
  • fastNLP_Bert, SGD, cuda, (pt2) foreach, momentum=0.9: +18.19237%
  • timm_efficientdet, Adadelta, cuda, (pt2) default: +10.70520%
  • timm_efficientdet, ASGD, cuda, (pt2) default: -12.04006%
  • timm_efficientdet, ASGD, cuda, (pt2) foreach: -13.99511%
  • timm_efficientdet, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +15.29808%
  • timm_efficientdet, SGD, cuda, (pt2) foreach, momentum=0.9: +15.42791%
  • timm_efficientdet, Rprop, cuda, (pt2) default: +30.52465%
  • drq, Adadelta, cuda, (pt2) maximize: -10.09427%
  • drq, Adadelta, cuda, foreach: -11.25980%
  • drq, Adagrad, cuda, foreach: +11.95062%
  • drq, Adam, cuda, (pt2) foreach: +11.88598%
  • drq, Adam, cuda, (pt2) foreach, maximize, capturable, amsgrad: -12.90203%
  • drq, Adam, cuda, (pt2) fused, amsgrad, maximize: -10.52199%
  • mobilenet_v2_quantized_qat, Adagrad, cuda, (pt2) default: +10.43355%
  • mobilenet_v2_quantized_qat, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +19.15766%
  • mobilenet_v2_quantized_qat, SGD, cuda, (pt2) foreach, momentum=0.9: +18.17385%
  • mobilenet_v2_quantized_qat, Rprop, cuda, default: -24.91401%
  • speech_transformer, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +16.65077%
  • speech_transformer, SGD, cuda, (pt2) foreach, momentum=0.9: +15.85889%
  • timm_resnest, Adadelta, cuda, (pt2) maximize: +15.89703%
  • timm_resnest, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +2959.28776%
  • timm_resnest, SGD, cuda, (pt2) foreach, momentum=0.9: +3186.72856%
  • attention_is_all_you_need_pytorch, Adadelta, cuda, (pt2) default: +11.73143%
  • attention_is_all_you_need_pytorch, Adagrad, cuda, (pt2) default: +10.04652%
  • attention_is_all_you_need_pytorch, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +19.00956%
  • attention_is_all_you_need_pytorch, SGD, cuda, (pt2) foreach, momentum=0.9: +18.41644%
  • hf_Longformer, Adadelta, cuda, (pt2) default: -10.64966%
  • hf_Longformer, Adagrad, cuda, (pt2) default: -11.89712%
  • hf_Longformer, ASGD, cuda, (pt2) maximize: +44.48423%
  • hf_Longformer, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +18.29870%
  • hf_Longformer, SGD, cuda, (pt2) foreach, momentum=0.9: +17.87722%
  • detectron2_maskrcnn_r_50_c4, Adadelta, cuda, (pt2) maximize: +10.54812%
  • detectron2_maskrcnn_r_50_c4, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +83.65549%
  • detectron2_maskrcnn_r_50_c4, SGD, cuda, (pt2) foreach, momentum=0.9: +95.24117%
  • doctr_det_predictor, ASGD, cuda, foreach: +15.18171%
  • doctr_det_predictor, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +18.81946%
  • doctr_det_predictor, SGD, cuda, (pt2) foreach, momentum=0.9: +18.12069%
  • timm_regnet, Adadelta, cuda, (pt2) default: +13.45302%
  • timm_regnet, Adadelta, cuda, (pt2) maximize: +18.32833%
  • timm_regnet, Adadelta, cuda, (pt2) foreach: +15.38946%
  • timm_regnet, ASGD, cuda, (pt2) foreach: +10.83181%
  • timm_regnet, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +18.00947%
  • timm_regnet, SGD, cuda, (pt2) foreach, momentum=0.9: +16.71488%
  • timm_regnet, RAdam, cuda, (pt2) foreach: +10.19439%
  • Background_Matting, Adamax, cuda, maximize: -26.85421%
  • Background_Matting, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +16.68096%
  • Background_Matting, SGD, cuda, (pt2) foreach, momentum=0.9: +18.20642%
  • detectron2_fasterrcnn_r_101_fpn, Adamax, cuda, (pt2) foreach: -12.18403%
  • detectron2_fasterrcnn_r_101_fpn, ASGD, cuda, (pt2) foreach: -16.78324%
  • detectron2_fasterrcnn_r_101_fpn, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +20.66252%
  • detectron2_fasterrcnn_r_101_fpn, SGD, cuda, (pt2) foreach, momentum=0.9: +17.53270%
  • basic_gnn_sage, Adam, cuda, (pt2) foreach, maximize, capturable: +14.15495%
  • basic_gnn_sage, AdamW, cuda, (pt2) foreach, maximize, capturable: +12.83798%
  • basic_gnn_sage, SGD, cuda, (pt2) foreach: +28.14492%
  • basic_gnn_sage, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +49.09567%
  • basic_gnn_sage, SGD, cuda, (pt2) foreach, momentum=0.9: +55.91548%
  • detectron2_fasterrcnn_r_50_fpn, Adam, cuda, (pt2) foreach: -11.22727%
  • detectron2_fasterrcnn_r_50_fpn, ASGD, cuda, default: -22.51279%
  • detectron2_fasterrcnn_r_50_fpn, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +312.85896%
  • detectron2_fasterrcnn_r_50_fpn, SGD, cuda, (pt2) foreach, momentum=0.9: +1624.69176%
  • detectron2_maskrcnn, Adagrad, cuda, (pt2) maximize: +11.33200%
  • detectron2_maskrcnn, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +17.37506%
  • detectron2_maskrcnn, SGD, cuda, (pt2) foreach, momentum=0.9: +16.19853%
  • resnext50_32x4d, Adam, cuda, (pt2) foreach: +12.23779%
  • resnext50_32x4d, ASGD, cuda, (pt2) default: -15.34465%
  • resnext50_32x4d, SGD, cuda, (pt2) maximize: -10.04745%
  • resnext50_32x4d, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +18.73010%
  • resnext50_32x4d, SGD, cuda, (pt2) foreach, momentum=0.9: +19.50472%
  • timm_vision_transformer_large, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +17.54019%
  • timm_vision_transformer_large, SGD, cuda, (pt2) foreach, momentum=0.9: +15.93584%
  • tts_angular, Adagrad, cuda, (pt2) foreach: +11.13386%
  • tts_angular, Adam, cuda, (pt2) foreach, maximize, capturable, amsgrad: +10.22103%
  • tacotron2, Adadelta, cuda, (pt2) default: -15.04458%
  • tacotron2, Adam, cuda, (pt2) amsgrad, maximize: -10.00476%
  • tacotron2, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +49.71003%
  • tacotron2, SGD, cuda, (pt2) foreach, momentum=0.9: +58.32263%
  • opacus_cifar10, Adagrad, cuda, (pt2) maximize: +17.61479%
  • opacus_cifar10, Adagrad, cuda, (pt2) foreach: -17.63331%
  • opacus_cifar10, Adam, cuda, foreach, maximize, capturable, amsgrad: +10.13257%
  • opacus_cifar10, Adamax, cuda, default: +10.06345%
  • opacus_cifar10, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +64.13250%
  • opacus_cifar10, SGD, cuda, (pt2) foreach, momentum=0.9: +61.58718%
  • hf_Bert, Adagrad, cuda, (pt2) foreach: +15.29684%
  • hf_Bert, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +21.63244%
  • hf_Bert, SGD, cuda, (pt2) foreach, momentum=0.9: +21.51703%
  • hf_Bert, Rprop, cuda, foreach: -16.58714%
  • hf_T5_generate, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +20.99306%
  • hf_T5_generate, SGD, cuda, (pt2) foreach, momentum=0.9: +17.90938%
  • BERT_pytorch, Adadelta, cuda, (pt2) default: -11.70155%
  • BERT_pytorch, Adagrad, cuda, no_foreach: -13.71321%
  • BERT_pytorch, AdamW, cuda, (pt2) differentiable: +10.12920%
  • BERT_pytorch, AdamW, cuda, foreach, maximize, capturable, amsgrad: -10.23028%
  • BERT_pytorch, ASGD, cuda, (pt2) maximize: +13.91033%
  • BERT_pytorch, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +18.65085%
  • BERT_pytorch, SGD, cuda, (pt2) foreach, momentum=0.9: +14.59606%
  • BERT_pytorch, Rprop, cuda, (pt2) default: +10.78876%
  • torchrec_dlrm, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +26.43883%
  • torchrec_dlrm, SGD, cuda, (pt2) foreach, momentum=0.9: +32.61242%
  • cm3leon_generate, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +16.55265%
  • cm3leon_generate, SGD, cuda, (pt2) foreach, momentum=0.9: +18.47517%
  • maml, Rprop, cuda, (pt2) default: -16.25165%
  • detectron2_maskrcnn_r_50_fpn, Adam, cuda, (pt2) amsgrad, maximize: -10.21379%
  • detectron2_maskrcnn_r_50_fpn, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +17.51091%
  • detectron2_maskrcnn_r_50_fpn, SGD, cuda, (pt2) foreach, momentum=0.9: +16.54998%
  • detectron2_maskrcnn_r_50_fpn, Rprop, cuda, default: +13.67955%
  • detectron2_maskrcnn_r_50_fpn, Rprop, cuda, (pt2) maximize: +23.99300%
  • detectron2_maskrcnn_r_50_fpn, Rprop, cuda, foreach: -11.81693%
  • LearningToPaint, Adadelta, cuda, (pt2) maximize: -12.92135%
  • LearningToPaint, Adam, cuda, (pt2) foreach: +15.96174%
  • LearningToPaint, Adamax, cuda, (pt2) maximize: +11.54910%
  • LearningToPaint, Adamax, cuda, (pt2) foreach: +19.05312%
  • LearningToPaint, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +69.58901%
  • LearningToPaint, SGD, cuda, (pt2) foreach, momentum=0.9: +62.37140%
  • LearningToPaint, Rprop, cuda, foreach: -13.53516%
  • resnet50, Adadelta, cuda, (pt2) default: +11.84840%
  • resnet50, Adadelta, cuda, (pt2) no_foreach: +11.48196%
  • resnet50, Adagrad, cuda, (pt2) maximize: -13.25943%
  • resnet50, Adagrad, cuda, differentiable: +12.50403%
  • resnet50, Adam, cuda, (pt2) foreach, maximize, capturable, amsgrad: +14.92478%
  • resnet50, Adamax, cuda, (pt2) no_foreach: -12.12456%
  • resnet50, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +17.92886%
  • resnet50, SGD, cuda, (pt2) foreach, momentum=0.9: +16.78283%
  • resnet50, NAdam, cuda, (pt2) no_foreach: +10.97016%
  • resnet50, NAdam, cuda, (pt2) foreach: +13.65118%
  • doctr_reco_predictor, Adam, cuda, (pt2) foreach, maximize, capturable, amsgrad: +12.59885%
  • doctr_reco_predictor, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +253.56239%
  • doctr_reco_predictor, SGD, cuda, (pt2) foreach, momentum=0.9: +204.08757%
  • functorch_maml_omniglot, Adagrad, cuda, (pt2) maximize: +12.33350%
  • functorch_maml_omniglot, Adam, cuda, foreach, maximize, capturable, amsgrad: +14.00212%
  • functorch_maml_omniglot, Adam, cuda, (pt2) fused, amsgrad, maximize: -12.20215%
  • functorch_maml_omniglot, Adamax, cuda, (pt2) default: +12.51839%
  • functorch_maml_omniglot, ASGD, cuda, (pt2) maximize: -11.54243%
  • functorch_dp_cifar10, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +49.68793%
  • functorch_dp_cifar10, SGD, cuda, (pt2) foreach, momentum=0.9: +47.60016%
  • functorch_dp_cifar10, Rprop, cuda, (pt2) foreach: -24.75489%
  • moco, Adagrad, cuda, (pt2) default: -10.33141%
  • moco, Adagrad, cuda, maximize: -13.74868%
  • moco, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +17.10291%
  • moco, SGD, cuda, (pt2) foreach, momentum=0.9: +16.64791%
  • phlippe_resnet, Adam, cuda, (pt2) amsgrad, maximize: -10.10035%
  • phlippe_resnet, Adam, cuda, (pt2) foreach, maximize, capturable, amsgrad: +14.53721%
  • phlippe_resnet, ASGD, cuda, (pt2) maximize: +28.94919%
  • phlippe_resnet, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +47.50348%
  • phlippe_resnet, SGD, cuda, (pt2) foreach, momentum=0.9: +46.22181%
  • hf_GPT2_large, AdamW, cuda, foreach, maximize, capturable: -10.06231%
  • hf_GPT2_large, Adamax, cuda, (pt2) differentiable: +16.29530%
  • hf_GPT2_large, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +16.75150%
  • hf_GPT2_large, SGD, cuda, (pt2) foreach, momentum=0.9: +17.52965%
  • timm_efficientnet, Adagrad, cuda, default: -11.81640%
  • timm_efficientnet, Adam, cuda, (pt2) foreach, maximize, capturable, amsgrad: -11.45832%
  • timm_efficientnet, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +17.68977%
  • timm_efficientnet, SGD, cuda, (pt2) foreach, momentum=0.9: +15.59806%
  • timm_efficientnet, Rprop, cuda, (pt2) maximize: -14.43210%
  • lennard_jones, Adadelta, cuda, (pt2) default: +15.69300%
  • lennard_jones, Adadelta, cuda, (pt2) maximize: +10.75832%
  • lennard_jones, Adadelta, cuda, (pt2) foreach: +10.09663%
  • lennard_jones, Adagrad, cuda, (pt2) default: +10.87385%
  • lennard_jones, Adam, cuda, (pt2) foreach, maximize, capturable: -12.97829%
  • lennard_jones, SGD, cuda, (pt2) foreach: +24.94495%
  • lennard_jones, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +42.76188%
  • lennard_jones, SGD, cuda, (pt2) foreach, momentum=0.9: +45.20561%
  • pytorch_unet, Adagrad, cuda, default: -10.19017%
  • pytorch_unet, Adam, cuda, (pt2) foreach, maximize, capturable, amsgrad: -11.57650%
  • pytorch_unet, ASGD, cuda, default: -10.79926%
  • pytorch_unet, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +540.09837%
  • pytorch_unet, SGD, cuda, (pt2) foreach, momentum=0.9: +141.36992%
  • pytorch_unet, Rprop, cuda, (pt2) default: -12.80526%
  • pytorch_unet, Rprop, cuda, (pt2) foreach: +10.35265%
  • pytorch_CycleGAN_and_pix2pix, Adadelta, cuda, (pt2) maximize: -12.08851%
  • pytorch_CycleGAN_and_pix2pix, Adagrad, cuda, (pt2) default: -11.39100%
  • pytorch_CycleGAN_and_pix2pix, Adagrad, cuda, (pt2) foreach: -11.33181%
  • pytorch_CycleGAN_and_pix2pix, Adamax, cuda, (pt2) foreach: -16.21176%
  • pytorch_CycleGAN_and_pix2pix, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +25.50535%
  • pytorch_CycleGAN_and_pix2pix, SGD, cuda, (pt2) foreach, momentum=0.9: +19.08719%
  • hf_Bart, SGD, cuda, (pt2) foreach, momentum=0.9, nesterov: +17.02888%
  • hf_Bart, SGD, cuda, (pt2) foreach, momentum=0.9: +17.66717%
  • hf_Bart, Rprop, cuda, (pt2) foreach: -12.48494%
  • hf_Albert, AdamW, cuda, (pt2) foreach, maximize, capturable, amsgrad: -10.36349%
  • hf_Albert, SGD, cuda, (pt2) foreach, momentum=0.9: +10.59709%

Tests that were no longer run on affected commit:

Tests that were newly added on affected commit:

Runtime regressions found?
No runtime errors were found in the new benchmarks run--you are all good there!

GitHub workflow that triggered this issue: https://github.com/pytorch/benchmark/actions/runs/5020694796

cc @janeyx99

@janeyx99
Copy link
Contributor

Total: 361
speedup: 96 | slowdown: 265
(pt2): 311 | eager: 50
<20%: 291 | >=20%: 70
SGD: 159 ...
SGD with momentum: 152

doctr_reco_predictor, SGD, cuda, (pt2) foreach, momentum=0.9: +204.08757%
image
There seems to be significant slowdowns for SGD in general, esp with momentum...

@janeyx99
Copy link
Contributor

@mlazos was looking into this

@janeyx99 janeyx99 reopened this May 22, 2023
@janeyx99
Copy link
Contributor

Didn't mean to close

@janeyx99
Copy link
Contributor

The possible commits that went into torch between the two commits are:
image
and
image

not sure which one could have affected the momentum buffers in SGD

@janeyx99
Copy link
Contributor

From playing with Scuba, I've isolated that the regression is only for

  • SGD with momentum on PT2. No momentum = no regression. No PT2 = no regression.
  • SGD across the board suffered for all models, though llama was a big victim.
  • The regression happened between 5/18 and 5/19 between the commits 900ca4d and f66d5dd NOT including the first.
    image

@janeyx99 janeyx99 changed the title V2 Performance Signal Detected by TorchBench CI on '329bb2a33e40f4bc76b2e061b180d3234984c91b' SGD foreach with momentum x PT2 regression Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant