add TORCH_FORCE_SYNCHRONOUS_COLLECTIVES to force functional collectives to be synchronous #128331

bdhirsh · 2024-06-10T14:59:26Z

One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a wait_tensor properly on a functional collective.

This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory).

If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely.

cc @yifuwang / @wconstab, wdyt of this env var?

Stack from ghstack (oldest at bottom):

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k

…es to be synchronous [ghstack-poisoned]

pytorch-bot · 2024-06-10T14:59:30Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128331

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Too many GPU instances requested - any *.nvidia.gpu and ephemeral instances should experience significant queue times

✅ No Failures

As of commit 2b4a770 with merge base 81df076 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ezyang · 2024-06-10T15:25:10Z

Doesn't have to be in this PR, but you should think about where this should be doc'ed and how people who need to know about it can find out about it.

bdhirsh · 2024-06-10T15:33:49Z

yep, very fair (@drisspg - per that BE project you were looking into, do we have a nice way for users to find TORCH_* env vars already? Otherwise maybe if/once we agree that this env var is reasonable, I can look for a place in the docs to mention it)

Skylion007 · 2024-06-10T16:48:20Z

torch/csrc/distributed/c10d/Functional.cpp

+// Useful for debugging memory leaks with compile that surface
+// due to compile not properly waiting on functional collectives.
+bool force_synchronous_functional_collectives() {
+ static char const* temp = getenv("TORCH_FORCE_SYNCHRONOUS_COLLECTIVES");


Use std::getenv. Also what if the user wants to explicitly disable it?

I'll probably just switch to use the same API's to read envvars as the other distributed code: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L780

wconstab · 2024-06-10T17:12:02Z

I like the idea of adding the env, but not sure how to name it.

most of the distributed envs are named TORCH_NCCL_* and @shuqiang recently doc'd them all in this page, you should probably add an entry there and also in compile docs?

The 2 confusing points regarding naming and behavior are

its not NCCL specific, so probably TORCH_ is better
the env only affects functional collectives, which might confuse users of non-functional collectives

TORCH_FUNCOL_SYNC?

bdhirsh · 2024-06-10T19:02:49Z

The 2 confusing points regarding naming and behavior are

@wconstab thanks for the feedback. The point about most env vars including NCCL in their name but this not being nccl-specific is fair. TORCH_FUNCOL_SYNC sounds reasonable to me, I can change it to that.

…l collectives to be synchronous" One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective. This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory). If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely. cc yifuwang / wconstab, wdyt of this env var? cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

yifuwang · 2024-06-13T00:47:42Z

This looks great! Thanks for adding it.

Curious what are the most common sources of unwaited collectives based on your observation. It is more due to graph break boundary handling, or users calling the _c10d_functional ops without calling wait? For the latter, we can add a pass to lint/fix the program.

…l collectives to be synchronous" One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective. This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory). If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely. cc yifuwang / wconstab, wdyt of this env var? cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

…es to be synchronous ghstack-source-id: 8ccdf3e19024b408169d0e62013fd6652ead3da0 Pull Request resolved: #128331

…l collectives to be synchronous" One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective. This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory). If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely. cc yifuwang / wconstab, wdyt of this env var? cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

…es to be synchronous ghstack-source-id: 05cb6a20a36fbcfb2f561fdb573a27cd86333b81 Pull Request resolved: #128331

…l collectives to be synchronous" One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective. This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory). If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely. cc yifuwang / wconstab, wdyt of this env var? cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

…es to be synchronous ghstack-source-id: 3989292f029ad2b1dafde3e9d64224974b0ee9eb Pull Request resolved: #128331

…l collectives to be synchronous" One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective. This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory). If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely. cc yifuwang / wconstab, wdyt of this env var? cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

…es to be synchronous ghstack-source-id: 02eb26078cbbeb66bad9d86eb0e9e1f5a4af3f08 Pull Request resolved: #128331

…l collectives to be synchronous" One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective. This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory). If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely. cc yifuwang / wconstab, wdyt of this env var? cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

…es to be synchronous ghstack-source-id: b1478dec3b435ed68197712d98b9425b3b1f489a Pull Request resolved: #128331

…l collectives to be synchronous" One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective. This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory). If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely. cc yifuwang / wconstab, wdyt of this env var? cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]

…es to be synchronous ghstack-source-id: 3d87fe806ae26969f31ed74f5671573a0e15a588 Pull Request resolved: #128331

add TORCH_FORCE_SYNCHRONOUS_COLLECTIVES to force functional collectiv…

e18c904

…es to be synchronous [ghstack-poisoned]

bdhirsh mentioned this pull request Jun 10, 2024

dont prune unused symint graphargs from inner subclass tensors #128045

Open

pytorch-bot bot added oncall: distributed Add this issue/PR to distributed oncall triage queue release notes: distributed (c10d) release notes category labels Jun 10, 2024

This was referenced Jun 10, 2024

FunctionalTensor: dispatch metadata directly to inner tensor #127927

Closed

add auto-functionalize support for mutable list[Tensor] #127347

Open

github-actions bot requested review from albanD, antoniojkim, ezyang, miladm and SherlockNoMad June 10, 2024 14:59

bdhirsh requested a review from yf225 June 10, 2024 15:08

ezyang approved these changes Jun 10, 2024

View reviewed changes

Skylion007 reviewed Jun 10, 2024

View reviewed changes

bdhirsh added 2 commits June 12, 2024 07:22

bdhirsh mentioned this pull request Jun 12, 2024

inductor: pre-grad bmm pass shouldn't match if output is mutated #128570

Closed

bdhirsh added 2 commits June 13, 2024 14:52

bdhirsh mentioned this pull request Jun 18, 2024

fix dynamo isinstance inlining for nn.Parameter + subclasses #128981

Closed

bdhirsh added a commit that referenced this pull request Jun 18, 2024

add TORCH_FORCE_SYNCHRONOUS_COLLECTIVES to force functional collectiv…

c547c84

…es to be synchronous ghstack-source-id: 8ccdf3e19024b408169d0e62013fd6652ead3da0 Pull Request resolved: #128331

bdhirsh mentioned this pull request Jun 27, 2024

AOTI: dont treat views of buffers as constants #129688

Closed

bdhirsh added a commit that referenced this pull request Jun 27, 2024

add TORCH_FORCE_SYNCHRONOUS_COLLECTIVES to force functional collectiv…

0dca085

…es to be synchronous ghstack-source-id: 05cb6a20a36fbcfb2f561fdb573a27cd86333b81 Pull Request resolved: #128331

bdhirsh added a commit that referenced this pull request Jun 28, 2024

add TORCH_FORCE_SYNCHRONOUS_COLLECTIVES to force functional collectiv…

8e7c5bc

…es to be synchronous ghstack-source-id: 3989292f029ad2b1dafde3e9d64224974b0ee9eb Pull Request resolved: #128331

bdhirsh mentioned this pull request Jul 9, 2024

inductor: avoiding moving constructor to cuda when it would cause h2d sync in index_put_ fallback #130338

Closed

bdhirsh added a commit that referenced this pull request Jul 9, 2024

add TORCH_FORCE_SYNCHRONOUS_COLLECTIVES to force functional collectiv…

39cf0d0

…es to be synchronous ghstack-source-id: 02eb26078cbbeb66bad9d86eb0e9e1f5a4af3f08 Pull Request resolved: #128331

albanD removed their request for review July 9, 2024 19:14

bdhirsh added a commit that referenced this pull request Jul 9, 2024

add TORCH_FORCE_SYNCHRONOUS_COLLECTIVES to force functional collectiv…

4a3af25

…es to be synchronous ghstack-source-id: b1478dec3b435ed68197712d98b9425b3b1f489a Pull Request resolved: #128331

kwen2501 approved these changes Jul 10, 2024

View reviewed changes

bdhirsh added a commit that referenced this pull request Jul 10, 2024

add TORCH_FORCE_SYNCHRONOUS_COLLECTIVES to force functional collectiv…

7d2a3c8

…es to be synchronous ghstack-source-id: 3d87fe806ae26969f31ed74f5671573a0e15a588 Pull Request resolved: #128331

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add TORCH_FORCE_SYNCHRONOUS_COLLECTIVES to force functional collectives to be synchronous #128331

add TORCH_FORCE_SYNCHRONOUS_COLLECTIVES to force functional collectives to be synchronous #128331

bdhirsh commented Jun 10, 2024 •

edited

Loading

pytorch-bot bot commented Jun 10, 2024 •

edited

Loading

ezyang commented Jun 10, 2024

bdhirsh commented Jun 10, 2024

Skylion007 Jun 10, 2024

bdhirsh Jun 10, 2024

wconstab commented Jun 10, 2024

bdhirsh commented Jun 10, 2024

yifuwang commented Jun 13, 2024

add TORCH_FORCE_SYNCHRONOUS_COLLECTIVES to force functional collectives to be synchronous #128331

Are you sure you want to change the base?

add TORCH_FORCE_SYNCHRONOUS_COLLECTIVES to force functional collectives to be synchronous #128331

Conversation

bdhirsh commented Jun 10, 2024 • edited Loading

pytorch-bot bot commented Jun 10, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128331

❗ 1 Active SEVs

✅ No Failures

ezyang commented Jun 10, 2024

bdhirsh commented Jun 10, 2024

Skylion007 Jun 10, 2024

Choose a reason for hiding this comment

bdhirsh Jun 10, 2024

Choose a reason for hiding this comment

wconstab commented Jun 10, 2024

bdhirsh commented Jun 10, 2024

yifuwang commented Jun 13, 2024

bdhirsh commented Jun 10, 2024 •

edited

Loading

pytorch-bot bot commented Jun 10, 2024 •

edited

Loading