Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor EpochMetric and make it idempotent #2800

Conversation

sadra-barikbin
Copy link
Collaborator

@sadra-barikbin sadra-barikbin commented Dec 11, 2022

By adding an assertion to check recomputing metric does not change its result, _test_distrib_integration fails.

@vfdev-5
Copy link
Collaborator

vfdev-5 commented Dec 11, 2022

Seems like there is a bug here:

if ws > 1 and not self._is_reduced:
# All gather across all processes
_prediction_tensor = cast(torch.Tensor, idist.all_gather(_prediction_tensor))
_target_tensor = cast(torch.Tensor, idist.all_gather(_target_tensor))
self._is_reduced = True

_prediction_tensor and _target_tensor are temporary tensors but self._is_reduced is set to True as for other metrics where @sync_all_reduce decorator updates metric's attributes and thus we avoid duplicated all_reduce calls.

Thanks for catching that @sadra-barikbin !

@sadra-barikbin sadra-barikbin force-pushed the Refactor-EpochMetric-and-make-it-idempotent branch from d3779e8 to 1f05f23 Compare February 17, 2023 13:57
@sadra-barikbin sadra-barikbin force-pushed the Refactor-EpochMetric-and-make-it-idempotent branch from 10d7542 to af77ead Compare February 17, 2023 15:13
Copy link
Collaborator

@vfdev-5 vfdev-5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @sadra-barikbin

@vfdev-5 vfdev-5 enabled auto-merge (squash) February 17, 2023 16:55
@vfdev-5 vfdev-5 merged commit 5d8d6bf into pytorch:master Feb 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants