Prevent cuda:0 context initialization when working on another cuda device #124722

vfdev-5 · 2024-04-23T12:32:30Z

Description

Issue description. When user works with "cuda:1" device and compile a model, there is cuda context initialization for device "cuda:0", which can be surprising to the user seeing with nvidia-smi the device 0 utilisation.

Reproduction code:

import torch
from torchvision.models import resnet18

def print_memory_usage():
    for d in [0, 1]:
        stats = torch.cuda.memory_stats(device=d)
        m = stats["allocated_bytes.all.allocated"] + stats["inactive_split_bytes.all.allocated"] + stats["reserved_bytes.all.allocated"]
        print(f"\t- CUDA Device: {d}, allocated + reserved + non-released in MB: {m / 1024 / 1024}")

device = "cuda:1"
model = resnet18()
compiled_model = torch.compile(model)

print("--- Before compiled model to device")
print_memory_usage()

compiled_model.to(device)
x = torch.rand(16, 3, 320, 320, device=device)

print("--- Before compiled model forward")
print_memory_usage()

y = compiled_model(x)

print("--- Before compiled model backward")
print_memory_usage()

y.sum().backward()

print("--- After compiled model backward")
print_memory_usage()

Output:

--- Before compiled model to device
        - CUDA Device: 0, allocated + reserved + non-released in MB: 0.0
        - CUDA Device: 1, allocated + reserved + non-released in MB: 0.0
--- Before compiled model forward
        - CUDA Device: 0, allocated + reserved + non-released in MB: 0.0
        - CUDA Device: 1, allocated + reserved + non-released in MB: 192.966796875
--- Before compiled model backward
        - CUDA Device: 0, allocated + reserved + non-released in MB: 8.044921875    # <--- this should be zero
        - CUDA Device: 1, allocated + reserved + non-released in MB: 2054.27197265625
--- After compiled model backward
        - CUDA Device: 0, allocated + reserved + non-released in MB: 8.044921875    # <--- this should be zero
        - CUDA Device: 1, allocated + reserved + non-released in MB: 5654.61962890625

This PR fixes cuda context initialization init_cuda_context on FakeTensor creation, lazy_init and pattern registrations.

--- Before compiled model to device
        - CUDA Device: 0, allocated + reserved + non-released in MB: 0.0
        - CUDA Device: 1, allocated + reserved + non-released in MB: 0.0
--- Before compiled model forward
        - CUDA Device: 0, allocated + reserved + non-released in MB: 0.0
        - CUDA Device: 1, allocated + reserved + non-released in MB: 192.966796875
--- Before compiled model backward
        - CUDA Device: 0, allocated + reserved + non-released in MB: 0.0
        - CUDA Device: 1, allocated + reserved + non-released in MB: 2054.31982421875
--- After compiled model backward
        - CUDA Device: 0, allocated + reserved + non-released in MB: 0.0
        - CUDA Device: 1, allocated + reserved + non-released in MB: 5654.66748046875

Fix the issue
Add tests

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

pytorch-bot · 2024-04-23T12:32:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124722

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 36 New Failures, 22 Unrelated Failures

As of commit 26a2ef7 with merge base bad8d25 ():

NEW FAILURES - The following jobs have failed:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
convit_base
inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
hf_T5
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_cpp_wrapper_abi_compatible, 1, 1, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_cuda_cpp_wrapper.py::TestCudaWrapper::test_adding_tensor_offsets_cuda_cuda_wrapper
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_distributed, 1, 1, linux.g5.12xlarge.nvidia.gpu) (gh)
test/inductor/test_torchinductor.py::GPUTests::test_multi_gpu_recompile_on_index_cuda
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_timm, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
mixer_b16_224
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
hf_T5_generate
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu) (gh)
Process completed with exit code 1.
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (cpu_inductor_torchbench, 1, 2, linux.12xlarge) (gh)
hf_Reformer
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (inductor_torchbench_cpu_smoketest_perf, 1, 1, linux.12xlarge) (gh)
AttributeError: 'SymInt' object has no attribute 'device'
inductor / rocm6.0-py3.8-inductor / test (inductor, 1, 1, linux.rocm.gpu.2) (gh)
Process completed with exit code 1.
pull / linux-focal-cuda11.8-py3.10-gcc9 / test (distributed, 1, 3, linux.8xlarge.nvidia.gpu) (gh)
distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_symbol_splitting
pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 2, 5, linux.4xlarge.nvidia.gpu) (gh)
inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_add_complex3_dynamic_shapes_cpu
pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 3, 5, linux.4xlarge.nvidia.gpu) (gh)
inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_abs_dynamic_shapes_cpu
pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 4, 5, linux.4xlarge.nvidia.gpu) (gh)
inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_adaptive_avg_pool2d2_dynamic_shapes_cpu
pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 5, 5, linux.4xlarge.nvidia.gpu) (gh)
inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_add_complex3_dynamic_shapes_cpu
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 2, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_pos_dynamic_shapes
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 4, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesGPUTests::test_large_block_sizes_dynamic_shapes_cuda
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 5, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_torchinductor_dynamic_shapes.py::DynamicShapesCpuTests::test_abs_dynamic_shapes_cpu
pull / linux-focal-py3.11-clang10 / test (crossref, 2, 2, linux.2xlarge) (gh)
test_content_store.py::TestContentStoreCPU::test_basic_cpu
pull / linux-focal-py3.11-clang10 / test (default, 3, 3, linux.2xlarge) (gh)
inductor/test_torchinductor.py::CpuTests::test_builtins_round_float_ndigits_pos_cpu
pull / linux-focal-py3.11-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh)
test_content_store.py::TestContentStoreCPU::test_basic_cpu
pull / linux-focal-py3.12-clang10 / test (default, 2, 3, linux.2xlarge) (gh)
dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_pos_dynamic_shapes
pull / linux-focal-py3.12-clang10 / test (default, 3, 3, linux.2xlarge) (gh)
dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_compare_constant_and_tensor_dynamic_shapes
pull / linux-focal-py3.12-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh)
test_content_store.py::TestContentStoreCPU::test_basic_cpu
pull / linux-focal-py3.8-clang10 / test (crossref, 2, 2, linux.2xlarge) (gh)
test_content_store.py::TestContentStoreCPU::test_basic_cpu
pull / linux-focal-py3.8-clang10 / test (default, 2, 3, linux.2xlarge) (gh)
dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_pos_dynamic_shapes
pull / linux-focal-py3.8-clang10 / test (default, 3, 3, linux.2xlarge) (gh)
dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_compare_constant_and_tensor_dynamic_shapes
pull / linux-focal-py3.8-clang10 / test (dynamo, 3, 3, linux.2xlarge) (gh)
test_content_store.py::TestContentStoreCPU::test_basic_cpu
pull / linux-jammy-py3.10-clang15-asan / test (default, 2, 6, linux.4xlarge) (gh)
inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_abs_dynamic_shapes_cpu
pull / linux-jammy-py3.10-clang15-asan / test (default, 3, 6, linux.4xlarge) (gh)
inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenCpuTests::test_adaptive_avg_pool2d1_dynamic_shapes_cpu
pull / linux-jammy-py3.10-clang15-asan / test (default, 4, 6, linux.4xlarge) (gh)
test_content_store.py::TestContentStoreCPU::test_basic_cpu
pull / linux-jammy-py3.10-clang15-asan / test (default, 5, 6, linux.4xlarge) (gh)
inductor/test_torchinductor.py::CpuTests::test_adding_tensor_offsets_cpu
pull / linux-jammy-py3.10-clang15-asan / test (default, 6, 6, linux.4xlarge) (gh)
inductor/test_torchinductor.py::CpuTests::test_both_scalars_cpu
pull / linux-jammy-py3.8-gcc11 / test (default, 2, 3, linux.2xlarge) (gh)
dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_pos_dynamic_shapes
pull / linux-jammy-py3.8-gcc11 / test (default, 3, 3, linux.2xlarge) (gh)
dynamo/test_dynamic_shapes.py::DynamicShapesFunctionTests::test_compare_constant_and_tensor_dynamic_shapes
pull / linux-jammy-py3.8-gcc11 / test (distributed, 1, 2, linux.2xlarge) (gh)
distributed/test_dynamo_distributed.py::TestFakeDistributedSingleProc::test_symbol_splitting

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

inductor / cuda12.1-py3.10-gcc9-sm80 / test (inductor_torchbench_smoketest_perf, 1, 1, linux.gcp.a100) (gh)
Process completed with exit code 255.
inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_huggingface, 1, 1, linux.g5.4xlarge.nvidia.gpu) (gh)
XGLMForCausalLM
inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
xcit_large_24_p8_224
inductor / cuda12.1-py3.10-gcc9-sm86 / test (dynamic_inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
yolov3
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_huggingface, 1, 1, linux.g5.4xlarge.nvidia.gpu) (gh)
XGLMForCausalLM
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_timm, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
xcit_large_24_p8_224
inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 2, 2, linux.g5.4xlarge.nvidia.gpu) (gh)
yolov3
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (cpu_inductor_torchbench, 2, 2, linux.12xlarge) (gh)
yolov3
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (dynamic_cpu_inductor_timm, 1, 2, linux.12xlarge) (gh)
mixer_b16_224
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (dynamic_cpu_inductor_timm, 2, 2, linux.12xlarge) (gh)
xcit_large_24_p8_224
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (dynamic_cpu_inductor_torchbench, 1, 2, linux.12xlarge) (gh)
hf_distil_whisper
inductor / linux-jammy-cpu-py3.8-gcc11-inductor / test (dynamic_cpu_inductor_torchbench, 2, 2, linux.12xlarge) (gh)
yolov3
pull / linux-focal-cuda12.1-py3.10-gcc9 / test (default, 1, 5, linux.4xlarge.nvidia.gpu) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_fake_distributed_inductor
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 1, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_max_autotune.py::TestMaxAutotune::test_max_autotune_addmm_dynamic_True
pull / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test (default, 3, 5, linux.g5.4xlarge.nvidia.gpu) (gh)
inductor/test_torchinductor_codegen_dynamic_shapes.py::DynamicShapesCodegenGPUTests::test_large_block_sizes_dynamic_shapes_cuda
pull / linux-focal-py3.11-clang10 / test (default, 1, 3, linux.2xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_fake_distributed_inductor
pull / linux-focal-py3.11-clang10 / test (default, 2, 3, linux.2xlarge) (gh)
RuntimeError: profiler/test_profiler 1/1 failed
pull / linux-focal-py3.12-clang10 / test (default, 1, 3, linux.2xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_fake_distributed_inductor
pull / linux-focal-py3.12-clang10 / test (dynamo, 2, 3, linux.2xlarge) (gh)
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
pull / linux-focal-py3.8-clang10 / test (default, 1, 3, linux.2xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_fake_distributed_inductor
pull / linux-jammy-py3.10-clang15-asan / test (default, 1, 6, linux.4xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_fake_distributed_inductor
pull / linux-jammy-py3.8-gcc11 / test (default, 1, 3, linux.2xlarge) (gh)
inductor/test_distributed_patterns.py::DistributedPatternTests::test_fake_distributed_inductor

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vfdev-5 · 2024-04-24T16:25:13Z

torch/_inductor/compile_fx.py

@@ -1094,7 +1094,13 @@ def fw_compiler_freezing(
 from torch._inductor.freezing import convert_conv_weights_to_channels_last, freeze

 # partition_fn won't be called
- _recursive_joint_graph_passes(aot_autograd_model)
+ inputs_devices = list(
+ {i.device for i in pytree.tree_flatten(aot_example_inputs)[0]}


Here, I should avoid fetching device on non-tensor input

github-actions · 2024-06-23T16:34:33Z

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

pytorch-bot bot added ciflow/inductor module: inductor labels Apr 23, 2024

pytorchbot added the open source label Apr 23, 2024

Prevent cuda:0 context initialization when working on another device

26a2ef7

vfdev-5 force-pushed the inductor-multigpu-cuda-alloc branch from 34abcd5 to 26a2ef7 Compare April 23, 2024 13:32

vfdev-5 commented Apr 24, 2024

View reviewed changes

vfdev-5 mentioned this pull request Apr 24, 2024

[inductor] unexpected cuda:0 device usage when compiling and runing a model on cuda:1 #124854

Open

lezcano mentioned this pull request Jun 21, 2024

Torch compile initialises CUDA context, even compiling CPU functions #129131

Open

github-actions bot added the Stale label Jun 23, 2024

github-actions bot closed this Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent cuda:0 context initialization when working on another cuda device #124722

Prevent cuda:0 context initialization when working on another cuda device #124722

vfdev-5 commented Apr 23, 2024 •

edited

Loading

pytorch-bot bot commented Apr 23, 2024 •

edited

Loading

vfdev-5 Apr 24, 2024

github-actions bot commented Jun 23, 2024

Prevent cuda:0 context initialization when working on another cuda device #124722

Prevent cuda:0 context initialization when working on another cuda device #124722

Conversation

vfdev-5 commented Apr 23, 2024 • edited Loading

Description

pytorch-bot bot commented Apr 23, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/124722

❌ 36 New Failures, 22 Unrelated Failures

vfdev-5 Apr 24, 2024

Choose a reason for hiding this comment

github-actions bot commented Jun 23, 2024

vfdev-5 commented Apr 23, 2024 •

edited

Loading

pytorch-bot bot commented Apr 23, 2024 •

edited

Loading