Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a handle_buffers option for EMAHandler #2592

Merged
merged 6 commits into from
Jun 15, 2022

Conversation

sandylaker
Copy link
Contributor

@sandylaker sandylaker commented Jun 6, 2022

Fixes #2590

Description:

Add a handle_buffers option for EMAHandler.

cc @vfdev-5 @RicherMans

Check list:

  • New tests are added (if a new feature is added)
  • New doc strings: description and/or example code are in RST format
  • Documentation is updated (if required)

@github-actions github-actions bot added the module: handlers Core Handlers module label Jun 6, 2022
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jun 6, 2022

@sandylaker thanks for a quick PR!

As for the problem about integral params in buffers, does it mean that in pytorch they should here a similar issue here: https://github.com/pytorch/pytorch/blob/080cf84bed46c6c118c37fc2fa6fbd484fd9b4cd/torch/optim/swa_utils.py#L124-L132 ?

@RicherMans
Copy link

@vfdev-5

As for the problem about integral params in buffers, does it mean that in pytorch they should here a similar issue here: https://github.com/pytorch/pytorch/blob/080cf84bed46c6c118c37fc2fa6fbd484fd9b4cd/torch/optim/swa_utils.py#L124-L132 ?

Yeah they should, but the default usage for the SWA model is to update the batchnorm after training and not during via https://github.com/pytorch/pytorch/blob/080cf84bed46c6c118c37fc2fa6fbd484fd9b4cd/torch/optim/swa_utils.py#L136 .

Also as I would hypothesize, that most code and examples usually set the model to .train() before transfer, thus the problem is less likely to occur.
Also (at least for me), its not that the model itself is throwing exceptions or is completely "useless", but rather that it just converges to a less optimal solution due to the "wrong" means and variances within the batch-norm.

Actually, I once trained a common Mean-teacher recipe and checked the absolute difference between the online (student) model and the EMA model.
For a lower layer like the first batch-norm, the batch-norm differences for mean/variance are neglectable and usually in range < 1e-1
But for higher layers, the differences get very noticeable at about ~ += 2 for means and += 200 for variance.

@sandylaker
Copy link
Contributor Author

sandylaker commented Jun 6, 2022

@vfdev-5 I did not test the torch.optim.swa_utils.AverageModel yet. But I did a simple experiment as follows:

>>> num_batches_tracked = torch.tensor(0, dtype=torch.int64)
>>> num_batches_tracked.copy_(num_batches_tracked * 0.9998 + 1 * 0.0002)
tensor(0)

I think the rounding error will be more severe when the online value is small.

@RicherMans

In the current implementation, the running_mean and running_var in a _BatchNorm will be updated with momentum, just like other params. But the num_batches_tracked will be directly copied. Do you think that the performence degradation issue will still present? Could you please test it in your codebase? Thanks a lot.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jun 6, 2022

@sandylaker if you check again "Solutions" of #2590 it looks like that RicherMans already tried to update floating buffers and copy integers (see his code snippet).
Looks like 2nd option in the solutions is better than 1st.

For this PR I think we could add an arg with 3 options: 1) "copy" = to keep current (copy) behaviour, 2) "update_buffers" = update buffers according to Richer's option 1 and 3) "ema_train" = Richer's option 2

What do you think ?

@RicherMans
Copy link

@sandylaker

In the current implementation, the running_mean and running_var in a _BatchNorm will be updated with momentum, just like other params. But the num_batches_tracked will be directly copied. Do you think that the performence degradation issue will still present? Could you please test it in your codebase? Thanks a lot.

So I actually did already check during my investigations all the proposed methods. So to be more precise for my experiments, I do sound event classification for the DCASE 2022 task, where the baseline uses a mean-teacher approach that I have reimplemented using EMAHandler.
For the baseline, they report two values (PSDS-1 and PSDS-2 scores) for the experiment. Higher is better.

The results are as follows:

Model PSDS1 PSDS2
Baseline 33.6 53.60
EMAHAndler v0.4.9 (Old) 16.95 30.86
use_buffers=True (New) 27.53 43.99
set model to .train() and do not update buffers (suggestion 2) 33.9 54.84

So as I originally stated, the momentum update of buffers such as the mean and variance works obviously better than the simple copying mechanism.
However, all results pale in comparison to just simply let the model itself compute it's mean and variance.

Of course this is a mean-teacher learning case, and my results might not be representative for other cases where one might want to "freeze" the buffers.

What do you guys think?

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jun 6, 2022

@RicherMans thanks for sharing details !

I think we could provide all 3 options such that depending on a use-case user could pick appropriate configuration (as suggested above: #2592 (comment)). We have to ensure that the 3rd option ema_model.train() does not break some assumptions in the flow.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jun 11, 2022

@sandylaker can we move forward with this PR and implement 3 options described in #2592 (comment) ? Thank you

@sandylaker
Copy link
Contributor Author

@vfdev-5 Hi. Sorry for the late reply, I have been busy during the working days. I will implement it this weekend.

@sandylaker sandylaker changed the title Add a use_buffers option for EMAHandler Add a handle_buffers option for EMAHandler Jun 14, 2022
Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks a lot @sandylaker !

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jun 14, 2022

@RicherMans could you please check this PR if it solves the issue you have

@vfdev-5 vfdev-5 merged commit b7f3d60 into pytorch:master Jun 15, 2022
@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jun 15, 2022

I merged this PR and if we need more changes to it, we can do that it in a follow-up PR.
Thanks again @sandylaker !

@RicherMans
Copy link

Sorry @vfdev-5 for my late response.
Yeah that commit looks pretty good and I tested it, looks alright!

Thanks guys for the good work! ( Even though I hope I can also one time commit myself )

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Jun 15, 2022

Even though I hope I can also one time commit myself

That would be awesome ! By the way, if you need some help to start with that, feel free to join our discord: https://pytorch-ignite.ai/chat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: handlers Core Handlers module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EMAHandler (Buffer update) does lead to bad results for Mean-Teacher experiments
3 participants