Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamo eval the subfunction of a skiped frame with callback, bad performance and more error #128928

Closed
Ma-Jian1 opened this issue Jun 18, 2024 · 6 comments

Comments

@Ma-Jian1
Copy link
Contributor

Ma-Jian1 commented Jun 18, 2024

🐛 Describe the bug

torch.compile try to split and compile in first not support code, concate the compiled and non-compiled code, and then run them all under callback, which will try to recompile the not supported function (there maybe other supported code under the not supported code), but it does not ignore the not supported code, which is bad for performance, and throw more error.

Error logs

No response

Minified repro

 import torch

 m = torch.nn.SiLU()
 class A:
     def myfunc():
         pass

 def break_graph3(t):
     funcs = [A.myfunc, m]
     iter = ("SiLU" in f.__class__.__name__ for f in funcs) # "in" not  full supported
     a = all(iter) # generator not full supported

 def toy_example(t):
     t = t + 1
     break_graph3(t)
     t = t + 3
     return t


 fn = torch.compile(toy_example)
 print(fn(torch.randn(1)))

Versions

Collecting environment information...
PyTorch version: 2.2.2a0+gitb11808e
Is debug build: True
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.5 (ssh:https://gerrit.habana-labs.com:29418/tpc_llvm10 40a69d3611a3941b828718e8d803ea1cfb724976)
CMake version: version 3.28.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-107-generic-x86_64-with-glibc2.35
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 43 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
CPU family: 6
Model: 85
Thread(s) per core: 1
Core(s) per socket: 6
Socket(s): 2
Stepping: 0
BogoMIPS: 5187.81
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xsaves arat pku ospke md_clear flush_l1d arch_capabilities
Virtualization: VT-x
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 384 KiB (12 instances)
L1i cache: 384 KiB (12 instances)
L2 cache: 12 MiB (12 instances)
L3 cache: 38.5 MiB (2 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-11
Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
Vulnerability L1tf: Mitigation; PTE Inversion; VMX flush not necessary, SMT disabled
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Retbleed: Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Syscall hardening, KVM SW loop
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected

Versions of relevant libraries:
[pip3] flake8==7.0.0
[pip3] habana-torch-dataloader==1.17.0+git84e273963
[pip3] habana-torch-plugin==1.17.0+git84e273963
[pip3] intel-extension-for-pytorch==2.2.0
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.5
[pip3] pytorch-lightning==2.2.0.post0
[pip3] torch==2.2.2a0+gitb11808e
[pip3] torch-debug==2.2.0a0+git10d65c0
[pip3] torch_tb_profiler==0.4.0
[pip3] torchaudio==2.2.0+08901ad
[pip3] torchdata==0.7.1+5e6f7b7
[pip3] torchmetrics==1.3.2
[pip3] torchtext==0.17.0+400da5c
[pip3] torchvision==0.17.0+b2383d4
[pip3] triton==2.2.0
[conda] Could not collect

cc @ezyang @anijain2305 @chauhang

@masnesral
Copy link
Contributor

@Ma-Jian1, sorry what is the problem exactly? Is it just that dynamo does not currently support this generator? If so, then I believe this is a known shortcoming. I found this issue with more details: #93737

@Ma-Jian1
Copy link
Contributor Author

not that all.
I have correct the example.
in general, dynamo found it does not support this generator, then it split the graph, compile the graph before the split point.
it's all ok by now.
then it run all code with callback set, and it will try to recompile the not supported code, e.g. the generator here.
maybe it should run the generator without callback?

@Ma-Jian1
Copy link
Contributor Author

Ma-Jian1 commented Jun 20, 2024

I'm not sure if the generator is coming from cg.make_function_with_closure, or is it the original gen ?

@masnesral
Copy link
Contributor

@Ma-Jian1, sorry there's still something wrong with the example: name 'A' is not defined

@Ma-Jian1
Copy link
Contributor Author

@masnesral sorry for that, updated.

@Ma-Jian1
Copy link
Contributor Author

seems I have understood the whole story, but it is still related to the generator.
in the first pass, it just "creates" the generator and meets the operator "in".
in the after pass, it calls the "generator", throws "unimplement(generator)".
maybe it has little impact on performance, so it does not wrap the "generator" call specifically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants