-
Notifications
You must be signed in to change notification settings - Fork 830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA OOM and possible solution -- diffusers cli_demo.py with Nvidia 3090 24GB #92
Comments
The generated video is great, by the way -- keep up the good work! reef.mp4 |
So, have you solve your OOM problem? When we test the demo, it needs 23.9GB in fact. It may be a bit extreme, causing occasional oom. |
Yes, setting this environment variable before running solved it for me:
and I see it using a little over 20GB with that setting with nvidia-smi:
|
did you use pipe.enable_model_cpu_offload()? if not will use 36GB and will cause this problem |
Both attempts attempt to allocate the ~36GB |
PYTORCH_CUDA_ALLOC_CONF=exandable_segments:True Try this. |
Also does it with that set for both multi and single attempts |
you can use torch. 2.2 2.3 2.4 both not work right? |
Torch 2.2.2, 2.3.1, and 2.4.0 all fail with the same attempted memory usage of ~36GB |
Just an FYI: If you install accelerate from the branch in following PR, the Diffusers demo runs in ~18 GB. Context: huggingface/accelerate#2994 (comment) Codeimport gc
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
def flush():
gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated()
torch.cuda.reset_peak_memory_stats()
def bytes_to_giga_bytes(bytes):
return f"{(bytes / 1024 / 1024 / 1024):.3f}"
flush()
prompt = (
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
"The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
"atmosphere of this unique musical performance."
)
pipe = CogVideoXPipeline.from_pretrained("/raid/aryan/CogVideoX-trial", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
torch.cuda.empty_cache()
memory = bytes_to_giga_bytes(torch.cuda.memory_allocated())
max_memory = bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
max_reserved = bytes_to_giga_bytes(torch.cuda.max_memory_reserved())
print(f"{memory=}")
print(f"{max_memory=}")
print(f"{max_reserved=}")
export_to_video(video, "output.mp4", fps=8) The PR will be merged into accelerate main hopefully soon. If you cannot or do not want to, for some reason, use accelerate from the dev branch, you could do the following: Code without accelerate dev branch requirementimport gc
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
def flush():
gc.collect()
torch.cuda.empty_cache()
torch.cuda.reset_max_memory_allocated()
torch.cuda.reset_peak_memory_stats()
def bytes_to_giga_bytes(bytes):
return f"{(bytes / 1024 / 1024 / 1024):.3f}"
flush()
prompt = (
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
"The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
"atmosphere of this unique musical performance."
)
pipe = CogVideoXPipeline.from_pretrained("/raid/aryan/CogVideoX-trial", torch_dtype=torch.float16).to("cuda")
latents = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50, output_type="latent", return_dict=False)[0]
pipe.transformer.to("cpu")
pipe.text_encoder.to("cpu")
torch.cuda.synchronize()
torch.cuda.empty_cache()
with torch.no_grad():
video = pipe.decode_latents(latents, num_seconds=6)
video = pipe.video_processor.postprocess_video(video=video, output_type="pil")
torch.cuda.empty_cache()
memory = bytes_to_giga_bytes(torch.cuda.memory_allocated())
max_memory = bytes_to_giga_bytes(torch.cuda.max_memory_allocated())
max_reserved = bytes_to_giga_bytes(torch.cuda.max_memory_reserved())
print(f"{memory=}")
print(f"{max_memory=}")
print(f"{max_reserved=}")
export_to_video(video, "output.mp4", fps=8) Additionally, you can play around with the |
|
I am perfectly able to run the 2nd example above on an A4500 (20GB) with a fresh Pytorch 2.3 install and diffusers:main. We'll have to try and debug what's going wrong in your setup. You mention:
Can you paste the error stack trace here? Would like to know at what point it's failing. If it's failing somewhere in attention, it's probably because you're unable to use FA2 - which is necessary to be able to run with low memory. Can you try setting up pytroch so that it allows you to run FA2? |
Installed flash_attn 2.0.4 built from source, not sure how to force torch to let me use it Output
EDIT: SDPA is disabled on AMD cards because they decided that only cool kids can play with fun toys |
Updating to torch 2.5.0 nightly makes it attempt to allocate 70.63GB (35 per gpu now). Forcefully disabling SDPA via |
Yeah quite unfortunate that AMD doesn't support SDPA :( From the error logs, it seems like that's the only bottleneck that's making it not possible to run Cog for you. Let me know if you find any issues on the Diffusers side-of-things that I can help with |
must one or more gpus >= 20GB beacuse of VAE. |
Thank you, Does have any methods to splite VAE module? |
Not now, we will try tied vae, we tested balance in 3 GPU in 20GB. you can try if you can run at 3 * 16G GPU |
|
try install diffusers and accelerate libs from source and check cli demo in |
Hi,Why do I still need ~36GB of GPU mem even though I set pipe.enable_model_cpu_offload()? |
did you follow with cli_demo.py code and using NVIDIA Ampere or higher GPU like 3090 4090 |
I use nvidia A100 to run. and demo code is from huggingface. https://huggingface.co/THUDM/CogVideoX-2b |
you can try reinstall the diffusers and accelerate libs from source, and A100 must work with using infersence/cli_demo.py in this github repos |
We have updated the latest repository, dependencies can now be downloaded from pip. Updating dependencies and retrying cli_demo can solve the problem. If there are any new issues, we can open a new issue |
System Info / 系統信息
Thanks very much for releasing this great work!
In case this helps anyone else:
The diffusers
cli_demo.py
raised the CUDA OOM error below on an RTX 3090 with 24GB of VRAM using this command:... but works with barely enough free VRAM with this small adjustment -- set the env var
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
prior to running:CUDA OOM error:
nvidia-smi
before running -- ony 135MiB VRAM used:nvidia-smi
while running withPYTORCH_CUDA_ALLOC_CONF=exandable_segments:True
-- 23GB used but no CUDA OOM.This was using a fresh conda environment with dependencies installed by pip from requirements.txt (side note: please include
imageio
in requirements.txt and also open the version range for opencv-python to>=4.10
as4.10
was yanked from upstream)Information / 问题信息
Reproduction / 复现过程
Details above.
Expected behavior / 期待表现
Details above.
The text was updated successfully, but these errors were encountered: