Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch.compile with reduce-overhead: very long compile time + GPU memory continuously to grow #128424

Open
ydshieh opened this issue Jun 11, 2024 · 7 comments
Labels
module: cuda graphs Ability to capture and then replay streams of CUDA kernels module: dynamic shapes oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@ydshieh
Copy link

ydshieh commented Jun 11, 2024

馃悰 Describe the bug

For code snippet, see at the end

  • model.py: very simple
  • run.py: a bit more complex but just in order to measure the memory and timing

In short: torch.compile with reduce-overhead takes long time and the GPU memory usage continues to grow (in the second call to each shape seen).

Although the code snippet is dummy, the same situation happens when I check with Llama or Gemma models from HuggingFace.

This could be seen clearly from the output sections below. Here are a few explanations to facilitate the understanding (code snippet and the outputs):

  • Think each (outer) iteration as a (language model) call to generate (up to a max_len steps)
  • Think each (inner) iteration as a (language model) decoding steps (which calls the model's forward)
  • The 1st (outer) iteration (i.e. call to generate) sees all possible input shape: time taken and memory usage don't vary much
    • for max_len=1024
      • timing: 7.76404
      • Used GPU memory increased: 54.0 MB.
    • for max_len=2048
      • timing: 9.206949
      • Used GPU memory increased: 54.0 MB.
    • for max_len=4096
      • timing: 12.44839
      • Used GPU memory increased: 54.0 MB.
  • The 2nd (outer) iteration: takes much longer time and memory usage accumulate (see outputs (with information from intermediate steps) below)
    • for max_len=1024
      - timing: 115.606751
      - Used GPU memory increased: 150.0 MB.
    • for max_len=2048
      - timing: 232.851245
      - Used GPU memory increased: 302.0 MB.
    • for max_len=4096
      - timing: 565.084438
      - Used GPU memory increased: 606.0 MB.
  • The 3rd (outer) iteration: very fast and no memory issue
    • for max_len=1024
      • timing: 1.045248
      • Used GPU memory increased: 0.0 MB.
    • for max_len=2048
      • timing: 2.026303
      • Used GPU memory increased: 0.0 MB.
    • for max_len=4096
      • timing: 3.927233
      • Used GPU memory increased: 0.0 MB.

Question: The slowness and memory accumulation (especially with super small model here) in the 2nd iteration makes torch.compile impractical (in such use-case, which seems to be a common use-case)

outputs (brief)

1024 steps (per iteration)
================================================================================
max_len: 1024
iter_idx: 0

Used GPU memory: 120.9375 MB.

Used GPU memory: 174.9375 MB.

timing: 7.76404
max_memory_allocated increased: 8.2001953125 MB.
Used GPU memory increased: 54.0 MB.
------------------------------------------------------------
max_len: 1024
iter_idx: 1

Used GPU memory: 174.9375 MB.

Used GPU memory: 324.9375 MB.

timing: 115.606751
max_memory_allocated increased: 1.05615234375 MB.
Used GPU memory increased: 150.0 MB.
------------------------------------------------------------
max_len: 1024
iter_idx: 2

Used GPU memory: 324.9375 MB.

Used GPU memory: 324.9375 MB.

timing: 1.045248
max_memory_allocated increased: 0.0 MB.
Used GPU memory increased: 0.0 MB.
------------------------------------------------------------
max_len: 1024
iter_idx: 3

Used GPU memory: 324.9375 MB.

Used GPU memory: 324.9375 MB.

timing: 1.042516
max_memory_allocated increased: 0.0 MB.
Used GPU memory increased: 0.0 MB.
2048 steps (per iteration)
================================================================================
max_len: 2048
iter_idx: 0

Used GPU memory: 120.9375 MB.

Used GPU memory: 174.9375 MB.

timing: 9.206949
max_memory_allocated increased: 8.2744140625 MB.
Used GPU memory increased: 54.0 MB.
------------------------------------------------------------
max_len: 2048
iter_idx: 1

Used GPU memory: 174.9375 MB.

Used GPU memory: 476.9375 MB.

timing: 232.851245
max_memory_allocated increased: 2.11279296875 MB.
Used GPU memory increased: 302.0 MB.
------------------------------------------------------------
max_len: 2048
iter_idx: 2

Used GPU memory: 476.9375 MB.

Used GPU memory: 476.9375 MB.

timing: 2.026303
max_memory_allocated increased: 0.0 MB.
Used GPU memory increased: 0.0 MB.
------------------------------------------------------------
max_len: 2048
iter_idx: 3

Used GPU memory: 476.9375 MB.

Used GPU memory: 476.9375 MB.

timing: 1.983656
max_memory_allocated increased: 0.0 MB.
Used GPU memory increased: 0.0 MB.

outputs (with information from intermediate steps)

2048 steps
================================================================================
max_len: 2048
iter_idx: 0

Used GPU memory: 120.9375 MB.

step: 0001 | `max_memory_allocated` increased (per step): 8.2744 MB | Used GPU increased (since this iter.): 54.000 MB
step: 0002 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 54.000 MB
step: 0256 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 54.000 MB
step: 0512 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 54.000 MB
step: 0768 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 54.000 MB
step: 1024 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 54.000 MB
step: 1280 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 54.000 MB
step: 1536 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 54.000 MB
step: 1792 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 54.000 MB
step: 2047 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 54.000 MB
step: 2048 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 54.000 MB

Used GPU memory: 174.9375 MB.

timing: 9.206949
max_memory_allocated increased: 8.2744140625 MB.
Used GPU memory increased: 54.0 MB.
------------------------------------------------------------
max_len: 2048
iter_idx: 1

Used GPU memory: 174.9375 MB.

step: 0001 | `max_memory_allocated` increased (per step): 0.1265 MB | Used GPU increased (since this iter.): 0.000 MB
step: 0002 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 0.000 MB
step: 0256 | `max_memory_allocated` increased (per step): 0.0010 MB | Used GPU increased (since this iter.): 36.000 MB
step: 0512 | `max_memory_allocated` increased (per step): 0.0010 MB | Used GPU increased (since this iter.): 74.000 MB
step: 0768 | `max_memory_allocated` increased (per step): 0.0010 MB | Used GPU increased (since this iter.): 110.000 MB
step: 1024 | `max_memory_allocated` increased (per step): 0.0010 MB | Used GPU increased (since this iter.): 150.000 MB
step: 1280 | `max_memory_allocated` increased (per step): 0.0010 MB | Used GPU increased (since this iter.): 186.000 MB
step: 1536 | `max_memory_allocated` increased (per step): 0.0010 MB | Used GPU increased (since this iter.): 224.000 MB
step: 1792 | `max_memory_allocated` increased (per step): 0.0010 MB | Used GPU increased (since this iter.): 262.000 MB
step: 2047 | `max_memory_allocated` increased (per step): 0.0010 MB | Used GPU increased (since this iter.): 300.000 MB
step: 2048 | `max_memory_allocated` increased (per step): 0.0010 MB | Used GPU increased (since this iter.): 302.000 MB

Used GPU memory: 476.9375 MB.

timing: 232.851245
max_memory_allocated increased: 2.11279296875 MB.
Used GPU memory increased: 302.0 MB.
------------------------------------------------------------
max_len: 2048
iter_idx: 2

Used GPU memory: 476.9375 MB.

Used GPU memory: 476.9375 MB.

timing: 2.026303
max_memory_allocated increased: 0.0 MB.
Used GPU memory increased: 0.0 MB.
------------------------------------------------------------
max_len: 2048
iter_idx: 3

Used GPU memory: 476.9375 MB.

Used GPU memory: 476.9375 MB.

timing: 1.983656
max_memory_allocated increased: 0.0 MB.
Used GPU memory increased: 0.0 MB.

model.py

modeling code
import torch

class MyModel_2(torch.nn.Module):

    def __init__(self, max_len=8192, dim=1024):
        super().__init__()
        self.embedding = torch.nn.Embedding(num_embeddings=max_len, embedding_dim=dim)
        # Think these as a fixed-size static KV cache
        self.cached_keys = torch.nn.Parameter(torch.ones(size=(1, max_len, dim), dtype=torch.float32))
        self.cached_values = torch.nn.Parameter(torch.ones(size=(1, max_len, dim), dtype=torch.float32))

        self.max_len = max_len

    def forward(self, input_ids, attn_mask):

        q_len = input_ids.size()[1]
        # Create a mask with the target length being `self.max_len`
        _mask = torch.zeros(size=(1, q_len, self.max_len), dtype=torch.int32, device=input_ids.device).to(torch.bool)
        # Update `_mask` with the argument `attn_mask`
        _mask[:, :, :attn_mask.size()[2]] = attn_mask

        hidden = self.embedding(input_ids)
        attn_output = torch.nn.functional.scaled_dot_product_attention(
            query=hidden,
            key=self.cached_keys,
            value=self.cached_values,
            attn_mask=_mask,
        )
        return attn_output

run.py

A script to run.
import datetime
import multiprocessing


def run(model_type, max_len, n_iter=4, warmup_run=False, log_steps=64, detailed=False):

    import torch
    if model_type == "MyModel_1":
        from model import MyModel_1 as MyModel
    elif model_type == "MyModel_2":
        from model import MyModel_2 as MyModel

    device = "cuda"
    model = MyModel(max_len=max_len, dim=16).to(device)
    model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)

    with torch.no_grad():
        for iter_idx in range(n_iter):

            torch.cuda.empty_cache()

            print(f"max_len: {max_len}") if not warmup_run else 0
            print(f"iter_idx: {iter_idx}") if not warmup_run else 0

            steps = range(max_len)
            if warmup_run:
                steps = [0, 1, max_len-2, max_len - 1]

            torch.cuda.synchronize()
            s = datetime.datetime.now()

            for idx in steps:

                if model_type == "MyModel_1":
                    input_ids = torch.arange(idx+1, dtype=torch.int32, device=device).unsqueeze(0)
                elif model_type == "MyModel_2":
                    if idx == 0:
                        input_ids = torch.arange(3, dtype=torch.int32, device=device).unsqueeze(0)
                    else:
                        input_ids = torch.tensor([idx], dtype=torch.int32, device=device).unsqueeze(0)

                q_len = input_ids.size()[1]
                attn_mask = torch.ones(size=(q_len, idx + 1), dtype=torch.int32, device=device).unsqueeze(0).to(torch.bool)

                if idx == 0:
                    torch.cuda.empty_cache()
                    memory = torch.cuda.mem_get_info()
                    used_mem_start = (memory[1] - memory[0]) / 1024 / 1024
                    print(f"\nUsed GPU memory: {used_mem_start} MB.") if not warmup_run else 0

                    m_start = torch.cuda.max_memory_allocated(device=device)

                if not detailed:
                    _ = model(input_ids, attn_mask)
                else:
                    m1 = torch.cuda.max_memory_allocated(device=device)
                    _ = model(input_ids, attn_mask)
                    memory = torch.cuda.mem_get_info()
                    used_mem = (memory[1] - memory[0]) / 1024 / 1024
                    m2 = torch.cuda.max_memory_allocated(device=device)
                    diff_mem = max(m2 - m1, 0) / 1024 / 1024

                    if detailed and not warmup_run and iter_idx in [0, 1]:
                        if idx in [0, 1, max_len-2, max_len-1] or (idx + 1) % log_steps == 0:
                            if idx == 0:
                                print("")
                            print(f"step: {str(idx + 1).zfill(4)} | `max_memory_allocated` increased (per step): {'%.4f' % round(diff_mem, 4)} MB | Used GPU increased (since this iter.): {'%.3f' % round(used_mem - used_mem_start, 3)} MB")

            torch.cuda.synchronize()
            e = datetime.datetime.now()
            m_end = torch.cuda.max_memory_allocated(device=device)
            used_mem_end = (memory[1] - memory[0]) / 1024 / 1024
            diff_mem = max(m_end - m_start, 0) / 1024 / 1024

            print(f"\nUsed GPU memory: {used_mem_end} MB.") if not warmup_run else 0

            print(f"\ntiming: {(e-s).total_seconds()}") if not warmup_run else 0
            print(f"max_memory_allocated increased: {diff_mem} MB.") if not warmup_run else 0
            print(f"Used GPU memory increased: {used_mem_end - used_mem_start} MB.") if not warmup_run else 0
            print("-" * 60) if not warmup_run and iter_idx < n_iter - 1 else 0


model_type = "MyModel_2"
log_steps = 256
detailed = True
for max_len in [2048]:
    for i in range(2):

        warmup_run = not i
        n_iter = 2 if warmup_run else 4
        print("=" * 80) if not warmup_run else 0

        p = multiprocessing.Process(target=run, args=(model_type, max_len, n_iter, warmup_run, log_steps, detailed))
        p.start()
        p.join()

Versions

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.20.3
Libc version: glibc-2.31

Python version: 3.8.10 (default, Nov 22 2023, 10:22:35) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.10.0-30-cloud-amd64-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 550.54.15
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU @ 2.30GHz
Stepping: 0
CPU MHz: 2299.998
BogoMIPS: 4599.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 1 MiB
L3 cache: 45 MiB
NUMA node0 CPU(s): 0-7
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Mitigation; PTE Inversion
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat md_clear arch_capabilities

Versions of relevant libraries:
[pip3] intel-extension-for-pytorch==2.3.0
[pip3] mypy-extensions==1.0.0
[pip3] natten==0.15.1+torch220cu121
[pip3] numpy==1.24.3
[pip3] onnx==1.16.1
[pip3] onnxconverter-common==1.13.0
[pip3] onnxruntime==1.18.0
[pip3] onnxruntime-tools==1.7.0
[pip3] tf2onnx==1.16.1
[pip3] torch==2.3.0+cu121
[pip3] torchaudio==2.3.0+cu121
[pip3] torchvision==0.18.0+cu121
[pip3] triton==2.3.0
[conda] Could not collect

cc @mcarilli @ezyang @eellison @peterbell10 @bdhirsh @anijain2305 @chauhang

@ezyang
Copy link
Contributor

ezyang commented Jun 12, 2024

This is probably the same as #119640

reduce-overhead is not magic fairy dust. It works by doing CUDA graphs. CUDA graphs do not work with dynamic shapes. So we CUDA graph each individual dynamic shape individually. This can end up using a lot of memory. To reduce memory usage, you will need to do some padding at multiples. Or you can rearchitect your prefill/decode so that it is CUDA graph friendly, as was done in gpt-fast.

@vadimkantorov
Copy link
Contributor

vadimkantorov commented Jun 12, 2024

@ezyang Besides my support for more love for padding multiples for nestedtensor constructors (e.g. #65156) and more inplacing/out-versions (e.g. for torch.cat), but also could be cool to have somehow more introspection into the Inductor compiler cache / CUDA graph cache. E.g. if there was a way to list from Python all cached shape/dtype specializations, it would be easier to diagnose/confirm this sort of problems (e.g. it would be growing along with time) + maybe some higher-level metric on memory fragmentation or more examples on memory allocator stats. E.g. could one enable more coarse memory allocator segment sizes without torch recompilation? (this could go along with fully-fledged support for customized/reconfigured memory allocators)

@ydshieh
Copy link
Author

ydshieh commented Jun 12, 2024

Thanks @ezyang! OK, guess we have to use the workarounds you mentioned like padding.
(rearchitect your prefill/decode so that it is CUDA graph friendly is more complex to handle, as I work in HuggingFace transformers team and we try to keep the API stable).

I agree what @vadimkantorov mentioned about a way to investigate this cache stuff. (Probably it's already possible with TORCH_LOGS?)

@eellison
Copy link
Contributor

One other thing: cuda has a driver-level issue where cudagraphs take a lot of memory on device (64kb per kernel). That is fixed on cuda 12.4 and driver 550+.

@eellison eellison added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: cuda graphs Ability to capture and then replay streams of CUDA kernels labels Jun 12, 2024
@ydshieh
Copy link
Author

ydshieh commented Jun 13, 2024

@ezyang

Or you can rearchitect your prefill/decode so that it is CUDA graph friendly, as was done in gpt-fast.

Confirmed that keep all tensors (not just in the arguments of the top level forward) in a fixed (few) number of shapes avoid the issue.

Feel free if you think we could close this issue.

@leng-yue
Copy link
Contributor

Is there anyway we can save the cache if the input size is exactly same? Recompiling (even with cache) at first-run is very slow (40s+) for GPT-fast.

@ydshieh
Copy link
Author

ydshieh commented Jun 20, 2024

The issue mentioned in this issue is not related to recompiling and it the first run (iteration) is kind fast.

I personally tried GPT-fast (for another experimentation not related to this issue) and it works quite well for me. 40 seconds is quite reasonable to me however.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cuda graphs Ability to capture and then replay streams of CUDA kernels module: dynamic shapes oncall: pt2 triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

6 participants