`torch.compile` with `reduce-overhead`: very long compile time + GPU memory continuously to grow #128424

ydshieh · 2024-06-11T17:23:18Z

🐛 Describe the bug

For code snippet, see at the end

model.py: very simple
run.py: a bit more complex but just in order to measure the memory and timing

In short: `torch.compile` with `reduce-overhead` takes long time and the GPU memory usage continues to grow (in the second call to each shape seen).

Although the code snippet is dummy, the same situation happens when I check with Llama or Gemma models from HuggingFace.

This could be seen clearly from the output sections below. Here are a few explanations to facilitate the understanding (code snippet and the outputs):

Think each (outer) iteration as a (language model) call to generate (up to a max_len steps)
Think each (inner) iteration as a (language model) decoding steps (which calls the model's forward)
The 1st (outer) iteration (i.e. call to generate) sees all possible input shape: time taken and memory usage don't vary much
- for max_len=1024
  - timing: 7.76404
  - Used GPU memory increased: 54.0 MB.
- for max_len=2048
  - timing: 9.206949
  - Used GPU memory increased: 54.0 MB.
- for max_len=4096
  - timing: 12.44839
  - Used GPU memory increased: 54.0 MB.
The 2nd (outer) iteration: takes much longer time and memory usage accumulate (see outputs (with information from intermediate steps) below)
- for max_len=1024
  - timing: 115.606751
  - Used GPU memory increased: 150.0 MB.
- for max_len=2048
  - timing: 232.851245
  - Used GPU memory increased: 302.0 MB.
- for max_len=4096
  - timing: 565.084438
  - Used GPU memory increased: 606.0 MB.
The 3rd (outer) iteration: very fast and no memory issue
- for max_len=1024
  - timing: 1.045248
  - Used GPU memory increased: 0.0 MB.
- for max_len=2048
  - timing: 2.026303
  - Used GPU memory increased: 0.0 MB.
- for max_len=4096
  - timing: 3.927233
  - Used GPU memory increased: 0.0 MB.

Question: The slowness and memory accumulation (especially with super small model here) in the 2nd iteration makes `torch.compile` impractical (in such use-case, which seems to be a common use-case)

outputs (brief)

1024 steps (per iteration)

================================================================================
max_len: 1024
iter_idx: 0

Used GPU memory: 120.9375 MB.

Used GPU memory: 174.9375 MB.

timing: 7.76404
max_memory_allocated increased: 8.2001953125 MB.
Used GPU memory increased: 54.0 MB.
------------------------------------------------------------
max_len: 1024
iter_idx: 1

Used GPU memory: 174.9375 MB.

Used GPU memory: 324.9375 MB.

timing: 115.606751
max_memory_allocated increased: 1.05615234375 MB.
Used GPU memory increased: 150.0 MB.
------------------------------------------------------------
max_len: 1024
iter_idx: 2

Used GPU memory: 324.9375 MB.

Used GPU memory: 324.9375 MB.

timing: 1.045248
max_memory_allocated increased: 0.0 MB.
Used GPU memory increased: 0.0 MB.
------------------------------------------------------------
max_len: 1024
iter_idx: 3

Used GPU memory: 324.9375 MB.

Used GPU memory: 324.9375 MB.

timing: 1.042516
max_memory_allocated increased: 0.0 MB.
Used GPU memory increased: 0.0 MB.

2048 steps (per iteration)

================================================================================
max_len: 2048
iter_idx: 0

Used GPU memory: 120.9375 MB.

Used GPU memory: 174.9375 MB.

timing: 9.206949
max_memory_allocated increased: 8.2744140625 MB.
Used GPU memory increased: 54.0 MB.
------------------------------------------------------------
max_len: 2048
iter_idx: 1

Used GPU memory: 174.9375 MB.

Used GPU memory: 476.9375 MB.

timing: 232.851245
max_memory_allocated increased: 2.11279296875 MB.
Used GPU memory increased: 302.0 MB.
------------------------------------------------------------
max_len: 2048
iter_idx: 2

Used GPU memory: 476.9375 MB.

Used GPU memory: 476.9375 MB.

timing: 2.026303
max_memory_allocated increased: 0.0 MB.
Used GPU memory increased: 0.0 MB.
------------------------------------------------------------
max_len: 2048
iter_idx: 3

Used GPU memory: 476.9375 MB.

Used GPU memory: 476.9375 MB.

timing: 1.983656
max_memory_allocated increased: 0.0 MB.
Used GPU memory increased: 0.0 MB.

outputs (with information from intermediate steps)

2048 steps

================================================================================
max_len: 2048
iter_idx: 0

Used GPU memory: 120.9375 MB.

step: 0001 | `max_memory_allocated` increased (per step): 8.2744 MB | Used GPU increased (since this iter.): 54.000 MB
step: 0002 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 54.000 MB
step: 0256 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 54.000 MB
step: 0512 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 54.000 MB
step: 0768 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 54.000 MB
step: 1024 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 54.000 MB
step: 1280 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 54.000 MB
step: 1536 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 54.000 MB
step: 1792 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 54.000 MB
step: 2047 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 54.000 MB
step: 2048 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 54.000 MB

Used GPU memory: 174.9375 MB.

timing: 9.206949
max_memory_allocated increased: 8.2744140625 MB.
Used GPU memory increased: 54.0 MB.
------------------------------------------------------------
max_len: 2048
iter_idx: 1

Used GPU memory: 174.9375 MB.

step: 0001 | `max_memory_allocated` increased (per step): 0.1265 MB | Used GPU increased (since this iter.): 0.000 MB
step: 0002 | `max_memory_allocated` increased (per step): 0.0000 MB | Used GPU increased (since this iter.): 0.000 MB
step: 0256 | `max_memory_allocated` increased (per step): 0.0010 MB | Used GPU increased (since this iter.): 36.000 MB
step: 0512 | `max_memory_allocated` increased (per step): 0.0010 MB | Used GPU increased (since this iter.): 74.000 MB
step: 0768 | `max_memory_allocated` increased (per step): 0.0010 MB | Used GPU increased (since this iter.): 110.000 MB
step: 1024 | `max_memory_allocated` increased (per step): 0.0010 MB | Used GPU increased (since this iter.): 150.000 MB
step: 1280 | `max_memory_allocated` increased (per step): 0.0010 MB | Used GPU increased (since this iter.): 186.000 MB
step: 1536 | `max_memory_allocated` increased (per step): 0.0010 MB | Used GPU increased (since this iter.): 224.000 MB
step: 1792 | `max_memory_allocated` increased (per step): 0.0010 MB | Used GPU increased (since this iter.): 262.000 MB
step: 2047 | `max_memory_allocated` increased (per step): 0.0010 MB | Used GPU increased (since this iter.): 300.000 MB
step: 2048 | `max_memory_allocated` increased (per step): 0.0010 MB | Used GPU increased (since this iter.): 302.000 MB

Used GPU memory: 476.9375 MB.

timing: 232.851245
max_memory_allocated increased: 2.11279296875 MB.
Used GPU memory increased: 302.0 MB.
------------------------------------------------------------
max_len: 2048
iter_idx: 2

Used GPU memory: 476.9375 MB.

Used GPU memory: 476.9375 MB.

timing: 2.026303
max_memory_allocated increased: 0.0 MB.
Used GPU memory increased: 0.0 MB.
------------------------------------------------------------
max_len: 2048
iter_idx: 3

Used GPU memory: 476.9375 MB.

Used GPU memory: 476.9375 MB.

timing: 1.983656
max_memory_allocated increased: 0.0 MB.
Used GPU memory increased: 0.0 MB.

model.py

modeling code

import torch

class MyModel_2(torch.nn.Module):

    def __init__(self, max_len=8192, dim=1024):
        super().__init__()
        self.embedding = torch.nn.Embedding(num_embeddings=max_len, embedding_dim=dim)
        # Think these as a fixed-size static KV cache
        self.cached_keys = torch.nn.Parameter(torch.ones(size=(1, max_len, dim), dtype=torch.float32))
        self.cached_values = torch.nn.Parameter(torch.ones(size=(1, max_len, dim), dtype=torch.float32))

        self.max_len = max_len

    def forward(self, input_ids, attn_mask):

        q_len = input_ids.size()[1]
        # Create a mask with the target length being `self.max_len`
        _mask = torch.zeros(size=(1, q_len, self.max_len), dtype=torch.int32, device=input_ids.device).to(torch.bool)
        # Update `_mask` with the argument `attn_mask`
        _mask[:, :, :attn_mask.size()[2]] = attn_mask

        hidden = self.embedding(input_ids)
        attn_output = torch.nn.functional.scaled_dot_product_attention(
            query=hidden,
            key=self.cached_keys,
            value=self.cached_values,
            attn_mask=_mask,
        )
        return attn_output

run.py

A script to run.

import datetime
import multiprocessing


def run(model_type, max_len, n_iter=4, warmup_run=False, log_steps=64, detailed=False):

    import torch
    if model_type == "MyModel_1":
        from model import MyModel_1 as MyModel
    elif model_type == "MyModel_2":
        from model import MyModel_2 as MyModel

    device = "cuda"
    model = MyModel(max_len=max_len, dim=16).to(device)
    model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)

    with torch.no_grad():
        for iter_idx in range(n_iter):

            torch.cuda.empty_cache()

            print(f"max_len: {max_len}") if not warmup_run else 0
            print(f"iter_idx: {iter_idx}") if not warmup_run else 0

            steps = range(max_len)
            if warmup_run:
                steps = [0, 1, max_len-2, max_len - 1]

            torch.cuda.synchronize()
            s = datetime.datetime.now()

            for idx in steps:

                if model_type == "MyModel_1":
                    input_ids = torch.arange(idx+1, dtype=torch.int32, device=device).unsqueeze(0)
                elif model_type == "MyModel_2":
                    if idx == 0:
                        input_ids = torch.arange(3, dtype=torch.int32, device=device).unsqueeze(0)
                    else:
                        input_ids = torch.tensor([idx], dtype=torch.int32, device=device).unsqueeze(0)

                q_len = input_ids.size()[1]
                attn_mask = torch.ones(size=(q_len, idx + 1), dtype=torch.int32, device=device).unsqueeze(0).to(torch.bool)

                if idx == 0:
                    torch.cuda.empty_cache()
                    memory = torch.cuda.mem_get_info()
                    used_mem_start = (memory[1] - memory[0]) / 1024 / 1024
                    print(f"\nUsed GPU memory: {used_mem_start} MB.") if not warmup_run else 0

                    m_start = torch.cuda.max_memory_allocated(device=device)

                if not detailed:
                    _ = model(input_ids, attn_mask)
                else:
                    m1 = torch.cuda.max_memory_allocated(device=device)
                    _ = model(input_ids, attn_mask)
                    memory = torch.cuda.mem_get_info()
                    used_mem = (memory[1] - memory[0]) / 1024 / 1024
                    m2 = torch.cuda.max_memory_allocated(device=device)
                    diff_mem = max(m2 - m1, 0) / 1024 / 1024

                    if detailed and not warmup_run and iter_idx in [0, 1]:
                        if idx in [0, 1, max_len-2, max_len-1] or (idx + 1) % log_steps == 0:
                            if idx == 0:
                                print("")
                            print(f"step: {str(idx + 1).zfill(4)} | `max_memory_allocated` increased (per step): {'%.4f' % round(diff_mem, 4)} MB | Used GPU increased (since this iter.): {'%.3f' % round(used_mem - used_mem_start, 3)} MB")

            torch.cuda.synchronize()
            e = datetime.datetime.now()
            m_end = torch.cuda.max_memory_allocated(device=device)
            used_mem_end = (memory[1] - memory[0]) / 1024 / 1024
            diff_mem = max(m_end - m_start, 0) / 1024 / 1024

            print(f"\nUsed GPU memory: {used_mem_end} MB.") if not warmup_run else 0

            print(f"\ntiming: {(e-s).total_seconds()}") if not warmup_run else 0
            print(f"max_memory_allocated increased: {diff_mem} MB.") if not warmup_run else 0
            print(f"Used GPU memory increased: {used_mem_end - used_mem_start} MB.") if not warmup_run else 0
            print("-" * 60) if not warmup_run and iter_idx < n_iter - 1 else 0


model_type = "MyModel_2"
log_steps = 256
detailed = True
for max_len in [2048]:
    for i in range(2):

        warmup_run = not i
        n_iter = 2 if warmup_run else 4
        print("=" * 80) if not warmup_run else 0

        p = multiprocessing.Process(target=run, args=(model_type, max_len, n_iter, warmup_run, log_steps, detailed))
        p.start()
        p.join()

Versions

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.20.3
Libc version: glibc-2.31

Python version: 3.8.10 (default, Nov 22 2023, 10:22:35) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.10.0-30-cloud-amd64-x86_64-with-glibc2.29
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla T4
Nvidia driver version: 550.54.15
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU @ 2.30GHz
Stepping: 0
CPU MHz: 2299.998
BogoMIPS: 4599.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 1 MiB
L3 cache: 45 MiB
NUMA node0 CPU(s): 0-7
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Mitigation; PTE Inversion
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT Host state unknown
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; IBRS, IBPB conditional, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single pti ssbd ibrs ibpb stibp fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat md_clear arch_capabilities

Versions of relevant libraries:
[pip3] intel-extension-for-pytorch==2.3.0
[pip3] mypy-extensions==1.0.0
[pip3] natten==0.15.1+torch220cu121
[pip3] numpy==1.24.3
[pip3] onnx==1.16.1
[pip3] onnxconverter-common==1.13.0
[pip3] onnxruntime==1.18.0
[pip3] onnxruntime-tools==1.7.0
[pip3] tf2onnx==1.16.1
[pip3] torch==2.3.0+cu121
[pip3] torchaudio==2.3.0+cu121
[pip3] torchvision==0.18.0+cu121
[pip3] triton==2.3.0
[conda] Could not collect

cc @mcarilli @ezyang @eellison @peterbell10 @bdhirsh @anijain2305 @chauhang

The text was updated successfully, but these errors were encountered:

ezyang · 2024-06-12T01:45:20Z

This is probably the same as #119640

reduce-overhead is not magic fairy dust. It works by doing CUDA graphs. CUDA graphs do not work with dynamic shapes. So we CUDA graph each individual dynamic shape individually. This can end up using a lot of memory. To reduce memory usage, you will need to do some padding at multiples. Or you can rearchitect your prefill/decode so that it is CUDA graph friendly, as was done in gpt-fast.

vadimkantorov · 2024-06-12T10:30:38Z

@ezyang Besides my support for more love for padding multiples for nestedtensor constructors (e.g. #65156) and more inplacing/out-versions (e.g. for torch.cat), but also could be cool to have somehow more introspection into the Inductor compiler cache / CUDA graph cache. E.g. if there was a way to list from Python all cached shape/dtype specializations, it would be easier to diagnose/confirm this sort of problems (e.g. it would be growing along with time) + maybe some higher-level metric on memory fragmentation or more examples on memory allocator stats. E.g. could one enable more coarse memory allocator segment sizes without torch recompilation? (this could go along with fully-fledged support for customized/reconfigured memory allocators)

ydshieh · 2024-06-12T10:49:03Z

Thanks @ezyang! OK, guess we have to use the workarounds you mentioned like padding.
(rearchitect your prefill/decode so that it is CUDA graph friendly is more complex to handle, as I work in HuggingFace transformers team and we try to keep the API stable).

I agree what @vadimkantorov mentioned about a way to investigate this cache stuff. (Probably it's already possible with TORCH_LOGS?)

eellison · 2024-06-12T15:10:02Z

One other thing: cuda has a driver-level issue where cudagraphs take a lot of memory on device (64kb per kernel). That is fixed on cuda 12.4 and driver 550+.

ydshieh · 2024-06-13T12:55:22Z

@ezyang

Or you can rearchitect your prefill/decode so that it is CUDA graph friendly, as was done in gpt-fast.

Confirmed that keep all tensors (not just in the arguments of the top level forward) in a fixed (few) number of shapes avoid the issue.

Feel free if you think we could close this issue.

leng-yue · 2024-06-15T09:03:26Z

Is there anyway we can save the cache if the input size is exactly same? Recompiling (even with cache) at first-run is very slow (40s+) for GPT-fast.

ydshieh · 2024-06-20T11:30:00Z

The issue mentioned in this issue is not related to recompiling and it the first run (iteration) is kind fast.

I personally tried GPT-fast (for another experimentation not related to this issue) and it works quite well for me. 40 seconds is quite reasonable to me however.

colesbury added the oncall: pt2 label Jun 12, 2024

ezyang added the module: dynamic shapes label Jun 12, 2024

ydshieh mentioned this issue Jun 12, 2024

Generate: end-to-end compilation huggingface/transformers#30788

Open

eellison added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: cuda graphs Ability to capture and then replay streams of CUDA kernels labels Jun 12, 2024

ydshieh mentioned this issue Jun 14, 2024

[WIP] Dynamic length in static cache huggingface/transformers#30862

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`torch.compile` with `reduce-overhead`: very long compile time + GPU memory continuously to grow #128424

`torch.compile` with `reduce-overhead`: very long compile time + GPU memory continuously to grow #128424

ydshieh commented Jun 11, 2024 •

edited by pytorch-bot bot

Loading

ezyang commented Jun 12, 2024

vadimkantorov commented Jun 12, 2024 •

edited

Loading

ydshieh commented Jun 12, 2024

eellison commented Jun 12, 2024

ydshieh commented Jun 13, 2024

leng-yue commented Jun 15, 2024

ydshieh commented Jun 20, 2024

torch.compile with reduce-overhead: very long compile time + GPU memory continuously to grow #128424

torch.compile with reduce-overhead: very long compile time + GPU memory continuously to grow #128424

Comments

ydshieh commented Jun 11, 2024 • edited by pytorch-bot bot Loading

🐛 Describe the bug

In short: torch.compile with reduce-overhead takes long time and the GPU memory usage continues to grow (in the second call to each shape seen).

Question: The slowness and memory accumulation (especially with super small model here) in the 2nd iteration makes torch.compile impractical (in such use-case, which seems to be a common use-case)

outputs (brief)

outputs (with information from intermediate steps)

model.py

run.py

Versions

ezyang commented Jun 12, 2024

vadimkantorov commented Jun 12, 2024 • edited Loading

ydshieh commented Jun 12, 2024

eellison commented Jun 12, 2024

ydshieh commented Jun 13, 2024

leng-yue commented Jun 15, 2024

ydshieh commented Jun 20, 2024

`torch.compile` with `reduce-overhead`: very long compile time + GPU memory continuously to grow #128424

`torch.compile` with `reduce-overhead`: very long compile time + GPU memory continuously to grow #128424

ydshieh commented Jun 11, 2024 •

edited by pytorch-bot bot

Loading

In short: `torch.compile` with `reduce-overhead` takes long time and the GPU memory usage continues to grow (in the second call to each shape seen).

Question: The slowness and memory accumulation (especially with super small model here) in the 2nd iteration makes `torch.compile` impractical (in such use-case, which seems to be a common use-case)

vadimkantorov commented Jun 12, 2024 •

edited

Loading