You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello!
When I used tools/dist_train.sh to train voxelpose, I found that single-gpu, dual-gpu and 8-gpu DDP training has a phenomenon that the more gpus I use, the worse the effect is. Then I tried to turn on the option autoscale_lr, but the result is worse :(
In config, lr=0.0001 (consistent with voxelpose), batch_size=8
Is there something else I overlooked that caused this problem?
After reviewing some blogs, it is found that SyncBatchNorm may be one of the reasons for the decline of multi-gpu accuracy. May I ask whether we need to manually set or automatically perform BN synchronization during distributed training in mmpose?
[For question 2 Update] I tried to select SyncBN instead of BN3d when building_norm_layer, the accuracy has recovered, but it is still not as good as single gpu effect
The text was updated successfully, but these errors were encountered:
您好!
我在使用tools/dist_train.sh训练voxelpose时,发现单卡、双卡和8卡DDP训练存在显卡数越多效果越差的现象,并且尝试是否打开选项autoscale_lr,结果打开autoscale_lr效果更差 :(
config方面采用lr=0.0001(与voxelpose一致),batch_size=8
1.请问是否有其他地方我忽略了导致这个问题出现?
2.经过查阅资料发现,各卡之间独立计算BN可能是多卡精度下降的原因之一,请问我们mmpose中的分布式训练时同步BN需要手动设置还是自动进行?
【针对2 更新】尝试在build_norm_layer时选择SyncBN代替BN3d,精度有所恢复,但仍不及单卡效果
Hello!
When I used tools/dist_train.sh to train voxelpose, I found that single-gpu, dual-gpu and 8-gpu DDP training has a phenomenon that the more gpus I use, the worse the effect is. Then I tried to turn on the option autoscale_lr, but the result is worse :(
In config, lr=0.0001 (consistent with voxelpose), batch_size=8
[For question 2 Update] I tried to select SyncBN instead of BN3d when building_norm_layer, the accuracy has recovered, but it is still not as good as single gpu effect
The text was updated successfully, but these errors were encountered: