-
Notifications
You must be signed in to change notification settings - Fork 22k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add TORCH_FORCE_SYNCHRONOUS_COLLECTIVES to force functional collectives to be synchronous #128331
base: gh/bdhirsh/578/base
Are you sure you want to change the base?
Conversation
…es to be synchronous [ghstack-poisoned]
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128331
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ✅ No FailuresAs of commit 2b4a770 with merge base 81df076 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Doesn't have to be in this PR, but you should think about where this should be doc'ed and how people who need to know about it can find out about it. |
yep, very fair (@drisspg - per that BE project you were looking into, do we have a nice way for users to find TORCH_* env vars already? Otherwise maybe if/once we agree that this env var is reasonable, I can look for a place in the docs to mention it) |
// Useful for debugging memory leaks with compile that surface | ||
// due to compile not properly waiting on functional collectives. | ||
bool force_synchronous_functional_collectives() { | ||
static char const* temp = getenv("TORCH_FORCE_SYNCHRONOUS_COLLECTIVES"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use std::getenv. Also what if the user wants to explicitly disable it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll probably just switch to use the same API's to read envvars as the other distributed code: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp#L780
I like the idea of adding the env, but not sure how to name it. most of the distributed envs are named The 2 confusing points regarding naming and behavior are
TORCH_FUNCOL_SYNC? |
@wconstab thanks for the feedback. The point about most env vars including |
…l collectives to be synchronous" One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective. This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory). If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely. cc yifuwang / wconstab, wdyt of this env var? cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]
…l collectives to be synchronous" One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective. This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory). If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely. cc yifuwang / wconstab, wdyt of this env var? cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]
This looks great! Thanks for adding it. Curious what are the most common sources of unwaited collectives based on your observation. It is more due to graph break boundary handling, or users calling the |
…l collectives to be synchronous" One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective. This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory). If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely. cc yifuwang / wconstab, wdyt of this env var? cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]
…l collectives to be synchronous" One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective. This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory). If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely. cc yifuwang / wconstab, wdyt of this env var? cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]
…l collectives to be synchronous" One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective. This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory). If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely. cc yifuwang / wconstab, wdyt of this env var? cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]
…es to be synchronous ghstack-source-id: 8ccdf3e19024b408169d0e62013fd6652ead3da0 Pull Request resolved: #128331
…l collectives to be synchronous" One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective. This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory). If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely. cc yifuwang / wconstab, wdyt of this env var? cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]
…es to be synchronous ghstack-source-id: 05cb6a20a36fbcfb2f561fdb573a27cd86333b81 Pull Request resolved: #128331
…l collectives to be synchronous" One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective. This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory). If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely. cc yifuwang / wconstab, wdyt of this env var? cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]
…es to be synchronous ghstack-source-id: 3989292f029ad2b1dafde3e9d64224974b0ee9eb Pull Request resolved: #128331
…l collectives to be synchronous" One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective. This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory). If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely. cc yifuwang / wconstab, wdyt of this env var? cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]
…es to be synchronous ghstack-source-id: 02eb26078cbbeb66bad9d86eb0e9e1f5a4af3f08 Pull Request resolved: #128331
…l collectives to be synchronous" One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective. This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory). If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely. cc yifuwang / wconstab, wdyt of this env var? cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]
…es to be synchronous ghstack-source-id: b1478dec3b435ed68197712d98b9425b3b1f489a Pull Request resolved: #128331
…l collectives to be synchronous" One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a `wait_tensor` properly on a functional collective. This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory). If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely. cc yifuwang / wconstab, wdyt of this env var? cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k [ghstack-poisoned]
…es to be synchronous ghstack-source-id: 3d87fe806ae26969f31ed74f5671573a0e15a588 Pull Request resolved: #128331
One way that OOM's can show up with using torch.compile with distributed code is if our compile fails to issue a
wait_tensor
properly on a functional collective.This PR adds an env var that forces all functional collectives to be synchronous at runtime (making issuing wait_tensor at runtime a no-op, so if we fail to issue wait_tensors properly at runtime then we will not leak memory).
If we see a mem leak / OOM when using compile with distributed code, this flag should at least quickly tell us if the mem leak is coming from not waiting on a collective, vs. something else entirely.
cc @yifuwang / @wconstab, wdyt of this env var?
Stack from ghstack (oldest at bottom):
cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k