Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add TORCH_FORCE_SYNCHRONOUS_COLLECTIVES to force functional collectives to be synchronous #128331

Open
wants to merge 11 commits into
base: gh/bdhirsh/578/base
Choose a base branch
from

Conversation

bdhirsh
Copy link
Contributor

@bdhirsh bdhirsh commented Jun 10, 2024

One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a wait_tensor properly on a functional collective.

This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory).

If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely.

cc @yifuwang / @wconstab, wdyt of this env var?

Stack from ghstack (oldest at bottom):

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

Copy link

pytorch-bot bot commented Jun 10, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128331

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

✅ No Failures

As of commit 2b4a770 with merge base 81df076 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@ezyang
Copy link
Contributor

ezyang commented Jun 10, 2024

Doesn't have to be in this PR, but you should think about where this should be doc'ed and how people who need to know about it can find out about it.

@bdhirsh
Copy link
Contributor Author

bdhirsh commented Jun 10, 2024

yep, very fair (@drisspg - per that BE project you were looking into, do we have a nice way for users to find TORCH_* env vars already? Otherwise maybe if/once we agree that this env var is reasonable, I can look for a place in the docs to mention it)

// Useful for debugging memory leaks with compile that surface
// due to compile not properly waiting on functional collectives.
bool force_synchronous_functional_collectives() {
static char const* temp = getenv("TORCH_FORCE_SYNCHRONOUS_COLLECTIVES");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use std::getenv. Also what if the user wants to explicitly disable it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably just switch to use the same API's to read envvars as the other distributed code: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L780

@wconstab
Copy link
Contributor

I like the idea of adding the env, but not sure how to name it.

most of the distributed envs are named TORCH_NCCL_* and @shuqiang recently doc'd them all in this page, you should probably add an entry there and also in compile docs?

The 2 confusing points regarding naming and behavior are

  • its not NCCL specific, so probably TORCH_ is better
  • the env only affects functional collectives, which might confuse users of non-functional collectives

TORCH_FUNCOL_SYNC?

@bdhirsh
Copy link
Contributor Author

bdhirsh commented Jun 10, 2024

The 2 confusing points regarding naming and behavior are

@wconstab thanks for the feedback. The point about most env vars including NCCL in their name but this not being nccl-specific is fair. TORCH_FUNCOL_SYNC sounds reasonable to me, I can change it to that.

…l collectives to be synchronous"

One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective.

This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory).

If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely.

cc yifuwang / wconstab, wdyt of this env var?




cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
…l collectives to be synchronous"

One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective.

This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory).

If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely.

cc yifuwang / wconstab, wdyt of this env var?




cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
@yifuwang
Copy link
Contributor

This looks great! Thanks for adding it.

Curious what are the most common sources of unwaited collectives based on your observation. It is more due to graph break boundary handling, or users calling the _c10d_functional ops without calling wait? For the latter, we can add a pass to lint/fix the program.

…l collectives to be synchronous"

One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective.

This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory).

If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely.

cc yifuwang / wconstab, wdyt of this env var?




cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
…l collectives to be synchronous"

One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective.

This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory).

If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely.

cc yifuwang / wconstab, wdyt of this env var?




cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
…l collectives to be synchronous"

One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective.

This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory).

If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely.

cc yifuwang / wconstab, wdyt of this env var?




cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Jun 18, 2024
…es to be synchronous

ghstack-source-id: 8ccdf3e19024b408169d0e62013fd6652ead3da0
Pull Request resolved: #128331
…l collectives to be synchronous"

One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective.

This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory).

If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely.

cc yifuwang / wconstab, wdyt of this env var?




cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Jun 27, 2024
…es to be synchronous

ghstack-source-id: 05cb6a20a36fbcfb2f561fdb573a27cd86333b81
Pull Request resolved: #128331
…l collectives to be synchronous"

One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective.

This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory).

If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely.

cc yifuwang / wconstab, wdyt of this env var?




cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Jun 28, 2024
…es to be synchronous

ghstack-source-id: 3989292f029ad2b1dafde3e9d64224974b0ee9eb
Pull Request resolved: #128331
…l collectives to be synchronous"

One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective.

This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory).

If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely.

cc yifuwang / wconstab, wdyt of this env var?




cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Jul 9, 2024
…es to be synchronous

ghstack-source-id: 02eb26078cbbeb66bad9d86eb0e9e1f5a4af3f08
Pull Request resolved: #128331
@albanD albanD removed their request for review July 9, 2024 19:14
…l collectives to be synchronous"

One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective.

This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory).

If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely.

cc yifuwang / wconstab, wdyt of this env var?




cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Jul 9, 2024
…es to be synchronous

ghstack-source-id: b1478dec3b435ed68197712d98b9425b3b1f489a
Pull Request resolved: #128331
…l collectives to be synchronous"

One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective.

This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory).

If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely.

cc yifuwang / wconstab, wdyt of this env var?




cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k

[ghstack-poisoned]
bdhirsh added a commit that referenced this pull request Jul 10, 2024
…es to be synchronous

ghstack-source-id: 3d87fe806ae26969f31ed74f5671573a0e15a588
Pull Request resolved: #128331
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants