Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA OOM and possible solution -- diffusers cli_demo.py with Nvidia 3090 24GB #92

Closed
1 of 2 tasks
cktlco opened this issue Aug 7, 2024 · 27 comments
Closed
1 of 2 tasks
Assignees

Comments

@cktlco
Copy link

cktlco commented Aug 7, 2024

System Info / 系統信息

Thanks very much for releasing this great work!

In case this helps anyone else:

The diffusers cli_demo.py raised the CUDA OOM error below on an RTX 3090 with 24GB of VRAM using this command:

python cli_demo.py --prompt "A fish swimming underwater through a colorful coral reef. Sun is shining brightly through the water. It is a beautiful scene suitable for use in an eye-catching television advertisement." --model_path THUDM/CogVideoX-2b --num_inference_steps 50

... but works with barely enough free VRAM with this small adjustment -- set the env var PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True prior to running:

PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python cli_demo.py --prompt "A fish swimming underwater through a colorful coral reef. Sun is shining brightly through the water. It is a beautiful scene suitable for use in an eye-catching television advertisement." --model_path THUDM/CogVideoX-2b --num_inference_steps 50

CUDA OOM error:

    return torch._C._nn.pad(input, pad, mode, value)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.66 GiB. GPU 0 has a total capacity of 23.48 GiB of which 1.53 GiB is free. Including non-PyTorch memory, this process has 21.81 GiB memory in use. Of the allocated memory 18.92 GiB is allocated by PyTorch, and 2.58 GiB is reserved by PyTorch but unallocated.

nvidia-smi before running -- ony 135MiB VRAM used:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:09:00.0 Off |                  N/A |
|  0%   32C    P8             17W /  350W |     135MiB /  24576MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

nvidia-smi while running with PYTORCH_CUDA_ALLOC_CONF=exandable_segments:True -- 23GB used but no CUDA OOM.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090        Off |   00000000:09:00.0 Off |                  N/A |
|  0%   40C    P2            157W /  350W |   23335MiB /  24576MiB |      4%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

This was using a fresh conda environment with dependencies installed by pip from requirements.txt (side note: please include imageio in requirements.txt and also open the version range for opencv-python to >=4.10 as 4.10 was yanked from upstream)

Information / 问题信息

  • The official example scripts / 官方的示例脚本
  • My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

Details above.

Expected behavior / 期待表现

Details above.

@cktlco
Copy link
Author

cktlco commented Aug 7, 2024

The generated video is great, by the way -- keep up the good work!

reef.mp4

@tengjiayan20
Copy link
Contributor

So, have you solve your OOM problem? When we test the demo, it needs 23.9GB in fact. It may be a bit extreme, causing occasional oom.

@cktlco
Copy link
Author

cktlco commented Aug 7, 2024

Yes, setting this environment variable before running solved it for me:

PYTORCH_CUDA_ALLOC_CONF=exandable_segments:True

and I see it using a little over 20GB with that setting with nvidia-smi:

20423MiB /  24576MiB

@tengjiayan20 tengjiayan20 self-assigned this Aug 7, 2024
@TNT3530
Copy link

TNT3530 commented Aug 8, 2024

I am getting OOM using 4 32GB GPUs. Using device_map="balanced" seems to split across 3 of the cards, before throwing the OOM error.
image

@zRzRzRzRzRzRzR
Copy link
Member

did you use pipe.enable_model_cpu_offload()? if not will use 36GB and will cause this problem
if you want to use multi GPUS just remove pipe.enable_model_cpu_offload()

@TNT3530
Copy link

TNT3530 commented Aug 8, 2024

did you use pipe.enable_model_cpu_offload()? if not will use 36GB and will cause this problem if you want to use multi GPUS just remove pipe.enable_model_cpu_offload()

image
this OOM's

image
as does this

Both attempts attempt to allocate the ~36GB
This also happens when swapping to sat/inference.sh with the sample .txt

@zRzRzRzRzRzRzR
Copy link
Member

PYTORCH_CUDA_ALLOC_CONF=exandable_segments:True Try this.
And what is your nvidia driver and GPU? V100 I guess. It should work( Although we only test in 3090 and A100)

@TNT3530
Copy link

TNT3530 commented Aug 8, 2024

PYTORCH_CUDA_ALLOC_CONF=exandable_segments:True Try this. And what is your nvidia driver and GPU? V100 I guess. It should work( Although we only test in 3090 and A100)

Also does it with that set for both multi and single attempts
GPUs are AMD Instinct MI100s, ROCm 6.0
I do notice "Torch was not compiled with memory efficient attention..." in the logs, so I'm guessing it may just be an issue with the ROCm variant of torch :(

@zRzRzRzRzRzRzR
Copy link
Member

you can use torch. 2.2 2.3 2.4 both not work right?

@TNT3530
Copy link

TNT3530 commented Aug 8, 2024

you can use torch. 2.2 2.3 2.4 both not work right?

Torch 2.2.2, 2.3.1, and 2.4.0 all fail with the same attempted memory usage of ~36GB

@a-r-r-o-w
Copy link

a-r-r-o-w commented Aug 8, 2024

Just an FYI: If you install accelerate from the branch in following PR, the Diffusers demo runs in ~18 GB. Context: huggingface/accelerate#2994 (comment)

Code
import gc

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video


def flush():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.reset_peak_memory_stats()


def bytes_to_giga_bytes(bytes):
    return f"{(bytes / 1024 / 1024 / 1024):.3f}"


flush()

prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
    "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
    "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
    "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
    "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
    "atmosphere of this unique musical performance."
)

pipe = CogVideoXPipeline.from_pretrained("/raid/aryan/CogVideoX-trial", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]

torch.cuda.empty_cache()
memory = bytes_to_giga_bytes(torch.cuda.memory_allocated())
max_memory = bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
max_reserved = bytes_to_giga_bytes(torch.cuda.max_memory_reserved())
print(f"{memory=}")
print(f"{max_memory=}")
print(f"{max_reserved=}")

export_to_video(video, "output.mp4", fps=8)

The PR will be merged into accelerate main hopefully soon. If you cannot or do not want to, for some reason, use accelerate from the dev branch, you could do the following:

Code without accelerate dev branch requirement
import gc

import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video


def flush():
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.reset_max_memory_allocated()
    torch.cuda.reset_peak_memory_stats()


def bytes_to_giga_bytes(bytes):
    return f"{(bytes / 1024 / 1024 / 1024):.3f}"


flush()

prompt = (
    "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
    "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
    "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
    "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
    "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
    "atmosphere of this unique musical performance."
)

pipe = CogVideoXPipeline.from_pretrained("/raid/aryan/CogVideoX-trial", torch_dtype=torch.float16).to("cuda")
latents = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50, output_type="latent", return_dict=False)[0]

pipe.transformer.to("cpu")
pipe.text_encoder.to("cpu")
torch.cuda.synchronize()
torch.cuda.empty_cache()

with torch.no_grad():
    video = pipe.decode_latents(latents, num_seconds=6)
    video = pipe.video_processor.postprocess_video(video=video, output_type="pil")

torch.cuda.empty_cache()
memory = bytes_to_giga_bytes(torch.cuda.memory_allocated())
max_memory = bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
max_reserved = bytes_to_giga_bytes(torch.cuda.max_memory_reserved())
print(f"{memory=}")
print(f"{max_memory=}")
print(f"{max_reserved=}")

export_to_video(video, "output.mp4", fps=8)

Additionally, you can play around with the device_map parameter if you have multiple GPUs, quantize text encoder or full transformer. Denoising only requires about 12-14 GB memory (if using cpu offloading) but it's the VAE that takes the most amount of memory (1 GB model + 17 GB decoding). We are working on figuring out tiled-decoding but nothing promising yet. I would imagine that CogVideoX would be runnable on a free-tier T4 or lower if someone can figure out tiling so if anyone's got ideas, feel free to PR at Diffusers

@TNT3530
Copy link

TNT3530 commented Aug 8, 2024

Just an FYI: If you install accelerate from the branch in following PR, the Diffusers demo runs in ~18 GB. Context: huggingface/accelerate#2994 (comment)
Code

The PR will be merged into accelerate main hopefully soon. If you cannot or do not want to, for some reason, use accelerate from the dev branch, you could do the following:
Code without accelerate dev branch requirement

Additionally, you can play around with the device_map parameter if you have multiple GPUs, quantize text encoder or full transformer. Denoising only requires about 12-14 GB memory (if using cpu offloading) but it's the VAE that takes the most amount of memory (1 GB model + 17 GB decoding). We are working on figuring out tiled-decoding but nothing promising yet. I would imagine that CogVideoX would be runnable on a free-tier T4 or lower if someone can figure out tiling so if anyone's got ideas, feel free to PR at Diffusers

  • Running your provided non-dev-branch code and only swapping out the model for the HF THUDM/CogVideoX-2b still OOMs with the same numbers.
  • Adding PYTORCH_NO_MEMORY_CACHING=1 and pipe.enable_model_cpu_offload() also OOMs
  • Installing from source after cloning huggingface/accelerate and checking out test-clear-memory-cpu-offload, running pip install . and executing, also OOMs
  • Doing all of the above combined but adding device_map="balanced" also OOMs

@a-r-r-o-w
Copy link

I am perfectly able to run the 2nd example above on an A4500 (20GB) with a fresh Pytorch 2.3 install and diffusers:main. We'll have to try and debug what's going wrong in your setup.

You mention:

I do notice "Torch was not compiled with memory efficient attention..." in the logs, so I'm guessing it may just be an issue with the ROCm variant of torch :(

Can you paste the error stack trace here? Would like to know at what point it's failing. If it's failing somewhere in attention, it's probably because you're unable to use FA2 - which is necessary to be able to run with low memory. Can you try setting up pytroch so that it allows you to run FA2?

@TNT3530
Copy link

TNT3530 commented Aug 9, 2024

I am perfectly able to run the 2nd example above on an A4500 (20GB) with a fresh Pytorch 2.3 install and diffusers:main. We'll have to try and debug what's going wrong in your setup.

You mention:

I do notice "Torch was not compiled with memory efficient attention..." in the logs, so I'm guessing it may just be an issue with the ROCm variant of torch :(

Can you paste the error stack trace here? Would like to know at what point it's failing. If it's failing somewhere in attention, it's probably because you're unable to use FA2 - which is necessary to be able to run with low memory. Can you try setting up pytroch so that it allows you to run FA2?

Installed flash_attn 2.0.4 built from source, not sure how to force torch to let me use it

Output

/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/cuda/memory.py:343: FutureWarning: torch.cuda.reset_max_memory_allocated now calls torch.cuda.reset_peak_memory_stats, which resets /all/ peak memory stats.
  warnings.warn(
The config attributes {'mid_block_add_attention': True} were passed to AutoencoderKLCogVideoX, but are not expected and will be ignored. Please verify your config.json configuration file.

Loading pipeline components...:   0%|          | 0/5 [00:00<?, ?it/s]The config attributes {'mid_block_add_attention': True} were passed to AutoencoderKLCogVideoX, but are not expected and will be ignored. Please verify your config.json configuration file.

Loading pipeline components...:  20%|██        | 1/5 [00:00<00:03,  1.03it/s]
Loading pipeline components...:  60%|██████    | 3/5 [00:01<00:00,  3.24it/s]
Loading pipeline components...:  80%|████████  | 4/5 [00:06<00:01,  1.98s/it]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]�[A

Loading checkpoint shards:  50%|█████     | 1/2 [00:05<00:05,  5.55s/it]�[A

Loading checkpoint shards: 100%|██████████| 2/2 [00:10<00:00,  5.17s/it]�[A
Loading checkpoint shards: 100%|██████████| 2/2 [00:10<00:00,  5.23s/it]

Loading pipeline components...: 100%|██████████| 5/5 [00:16<00:00,  4.82s/it]
Loading pipeline components...: 100%|██████████| 5/5 [00:16<00:00,  3.36s/it]

  0%|          | 0/50 [00:00<?, ?it/s]
  0%|          | 0/50 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/home/tnt3530/Documents/CogVideo/provided.py", line 28, in <module>
    latents = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50, output_type="latent", return_dict=False)[0]
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox.py", line 629, in __call__
    noise_pred = self.transformer(
                 ^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/diffusers/models/transformers/cogvideox_transformer_3d.py", line 326, in forward
    hidden_states, encoder_hidden_states = block(
                                           ^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/diffusers/models/transformers/cogvideox_transformer_3d.py", line 123, in forward
    attn_output = self.attn1(
                  ^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/diffusers/models/attention_processor.py", line 490, in forward
    return self.processor(
           ^^^^^^^^^^^^^^^
  File "/home/tnt3530/anaconda3/envs/cog/lib/python3.12/site-packages/diffusers/models/attention_processor.py", line 2216, in __call__
    hidden_states = F.scaled_dot_product_attention(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 35.31 GiB. GPU 1 has a total capacity of 31.98 GiB of which 24.15 GiB is free. Of the allocated memory 4.58 GiB is allocated by PyTorch, and 481.62 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

EDIT: SDPA is disabled on AMD cards because they decided that only cool kids can play with fun toys
pytorch/pytorch#112997

@TNT3530
Copy link

TNT3530 commented Aug 9, 2024

Updating to torch 2.5.0 nightly makes it attempt to allocate 70.63GB (35 per gpu now). Forcefully disabling SDPA via torch.backends.cuda.enable_flash_sdp(False) doesnt help either

@a-r-r-o-w
Copy link

Yeah quite unfortunate that AMD doesn't support SDPA :( From the error logs, it seems like that's the only bottleneck that's making it not possible to run Cog for you. Let me know if you find any issues on the Diffusers side-of-things that I can help with

@zhanghang1995
Copy link

Thank you for your great work. I has question, It can well be inferenced by multi-gpus(RTX 3060Ti) or must one or more gpus >= 20GB? Thank you
image

@zRzRzRzRzRzRzR
Copy link
Member

must one or more gpus >= 20GB beacuse of VAE.

@zhanghang1995
Copy link

must one or more gpus >= 20GB beacuse of VAE.

Thank you, Does have any methods to splite VAE module?

@zRzRzRzRzRzRzR
Copy link
Member

Not now, we will try tied vae, we tested balance in 3 GPU in 20GB. you can try if you can run at 3 * 16G GPU

@zhanghang1995
Copy link

Not now, we will try tied vae, we tested balance in 3 GPU in 20GB. you can try if you can run at 3 * 16G GPU
Thank you, If has any processing in tied vae, pls inform.

@zRzRzRzRzRzRzR
Copy link
Member

zRzRzRzRzRzRzR commented Aug 14, 2024

Not now, we will try tied vae, we tested balance in 3 GPU in 20GB. you can try if you can run at 3 * 16G GPU
Thank you, If has any processing in tied vae, pls inform.

try install diffusers and accelerate libs from source and check cli demo in CogVideoX-dev branch now, it will cost 12GB only for infer

@QAQEthan
Copy link

did you use pipe.enable_model_cpu_offload()? if not will use 36GB and will cause this problem if you want to use multi GPUS just remove pipe.enable_model_cpu_offload()

Hi,Why do I still need ~36GB of GPU mem even though I set pipe.enable_model_cpu_offload()?

@zRzRzRzRzRzRzR
Copy link
Member

did you follow with cli_demo.py code and using NVIDIA Ampere or higher GPU like 3090 4090

@QAQEthan
Copy link

did you follow with cli_demo.py code and using NVIDIA Ampere or higher GPU like 3090 4090

I use nvidia A100 to run. and demo code is from huggingface. https://huggingface.co/THUDM/CogVideoX-2b

@zRzRzRzRzRzRzR
Copy link
Member

you can try reinstall the diffusers and accelerate libs from source, and A100 must work with using infersence/cli_demo.py in this github repos

@zRzRzRzRzRzRzR
Copy link
Member

We have updated the latest repository, dependencies can now be downloaded from pip. Updating dependencies and retrying cli_demo can solve the problem. If there are any new issues, we can open a new issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants