[After Rebase] Top of Traceable FSDP2 stack #128996

yf225 · 2024-06-18T20:42:57Z

Stack from ghstack (oldest at bottom):

-> [After Rebase] Top of Traceable FSDP2 stack #128996

cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @chauhang @d4l3k @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire

[ghstack-poisoned]

pytorch-bot · 2024-06-18T20:43:00Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128996

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Failure with setup-ssh on Amazon Linux 2023 runners

❌ 30 New Failures, 3 Unrelated Failures

As of commit cf6e033 with merge base 734891a ():

NEW FAILURES - The following jobs have failed:

inductor / cuda12.1-py3.10-gcc9-sm86 / build (gh)
undefined reference to at::native::resize_bytes_cuda(c10::StorageImpl*, unsigned long)'`
inductor / cuda12.1-py3.12-gcc9-sm86 / build (gh)
curl: (22) The requested URL returned error:
inductor / cuda12.4-py3.10-gcc9-sm86 / build (gh)
curl: (22) The requested URL returned error:
inductor / linux-jammy-cpu-py3.12-gcc11-inductor-halide / build (gh)
curl: (22) The requested URL returned error:
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / build (gh)
Process completed with exit code 1.
inductor / rocm6.1-py3.8-inductor / build (gh)
undefined reference to at::native::resize_bytes_cuda(c10::StorageImpl*, unsigned long)'`
inductor-periodic / cuda12.1-py3.10-gcc9-sm80 / build (gh)
undefined reference to at::native::resize_bytes_cuda(c10::StorageImpl*, unsigned long)'`
inductor-periodic / cuda12.1-py3.10-gcc9-sm86-periodic-dynamo-benchmarks / build (gh)
curl: (22) The requested URL returned error:
Lint / lintrunner-clang / linux-job (gh)
curl: (22) The requested URL returned error:
Lint / lintrunner-noclang / linux-job (gh)
curl: (22) The requested URL returned error:
Lint / pr-sanity-checks (gh)
Process completed with exit code 1.
Lint / workflow-checks / linux-job (gh)
curl: (22) The requested URL returned error:
pull / linux-focal-cpu-py3.10-gcc9-bazel-test / build-and-test (default, 1, 1, linux.4xlarge) (gh)
/var/lib/jenkins/workspace/BUILD.bazel:739:11: Compiling torch/csrc/inductor/inductor_ops.cpp failed: (Exit 1): gcc failed: error executing command (from target //:torch) /opt/cache/bin/gcc -U_FORTIFY_SOURCE -fstack-protector -Wall -Wunused-but-set-parameter -Wno-free-nonheap-object -fno-omit-frame-pointer '-std=c++11' -MD -MF ... (remaining 307 arguments skipped)
pull / linux-focal-cuda11.8-py3.10-gcc9 / build (gh)
curl: (22) The requested URL returned error:
pull / linux-focal-cuda12.1-py3.10-gcc9 / build (gh)
curl: (22) The requested URL returned error:
pull / linux-focal-cuda12.1-py3.10-gcc9-experimental-split-build / build (gh)
curl: (22) The requested URL returned error:
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / build (gh)
undefined reference to at::native::resize_bytes_cuda(c10::StorageImpl*, unsigned long)'`
pull / linux-focal-py3_8-clang9-xla / build (gh)
ninja: build stopped: subcommand failed
pull / linux-focal-py3.12-clang10 / build (gh)
Process completed with exit code 1.
pull / linux-focal-py3.8-clang10 / build (gh)
Process completed with exit code 1.
pull / linux-focal-py3.8-clang10-onnx / build (gh)
curl: (22) The requested URL returned error:
pull / linux-focal-rocm6.1-py3.8 / build (gh)
curl: (22) The requested URL returned error:
pull / linux-jammy-cuda11.8-cudnn9-py3.8-clang12 / build (gh)
undefined reference to at::native::resize_bytes_cuda(c10::StorageImpl*, unsigned long)'`
pull / linux-jammy-py3-clang12-executorch / build (gh)
ninja: build stopped: subcommand failed
pull / linux-jammy-py3-clang12-mobile-build / build (gh)
Process completed with exit code 2.
pull / linux-jammy-py3.10-clang15-asan / build (gh)
Process completed with exit code 1.
pull / linux-jammy-py3.8-gcc11 / build (gh)
curl: (22) The requested URL returned error:
pull / linux-jammy-py3.8-gcc11-no-ops / build (gh)
curl: (22) The requested URL returned error:
pull / linux-jammy-py3.8-gcc11-pch / build (gh)
Process completed with exit code 1.
pull / win-vs2019-cpu-py3 / build (gh)
C:\actions-runner\_work\pytorch\pytorch\c10/cuda/CUDAMacros.h(8): fatal error C1083: Cannot open include file: 'c10/cuda/impl/cuda_cmake_macros.h': No such file or directory

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

pull / linux-focal-py3-clang9-android-ndk-r21e-gradle-custom-build-single-full-jit / build-and-test (default, 1, 1, linux.2xlarge) (gh) (matched linux rule in flaky-rules.json)
connect ETIMEDOUT 173.231.16.77:443
pull / linux-focal-py3-clang9-mobile-custom-build-static / build (gh) (matched linux rule in flaky-rules.json)
read ECONNRESET
pull / linux-focal-py3.11-clang10 / build (gh) (matched linux rule in flaky-rules.json)
connect ETIMEDOUT 173.231.16.77:443

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 7cd520d3e76cca56d9e7ee60ec0b12cabf7c2cee Pull Request resolved: #128996

yf225 · 2024-06-18T22:14:53Z

torch/_dynamo/output_graph.py

+ name,
+ **options,
+ )
+ return self.side_effects.track_object_existing(target, vt)


shouldn't need this if tracing into inbuilt nn module

yf225 · 2024-06-18T22:15:39Z

torch/_dynamo/utils.py

+ else:
+ proxy = mod.__class__.__new__(mod.__class__)
+ proxy.__dict__ = mod.__dict__
+ return proxy


shouldn't need this anymore

but maybe need some changes to existing code? (like the places that are still using nn_module_proxy()?)

yf225 · 2024-06-18T22:18:12Z

torch/_dynamo/variables/builder.py

+ value,
+ name,
+ source=self.get_source(),
+ )


we shouldn't rely on tx.output.nn_modules

yf225 · 2024-06-18T22:20:09Z