Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Birch-san's sub-quadratic attention implementation #6055

Merged
merged 8 commits into from
Jan 7, 2023

Conversation

brkirch
Copy link
Collaborator

@brkirch brkirch commented Dec 27, 2022

This is a highly memory efficient cross attention optimization that enables high resolution image generation with much less VRAM.

For more details see:
https://twitter.com/Birchlabs/status/1607503573906063362
Birch-san/diffusers#1

Edit: MIT license added. As far as licensing goes, this PR should be ready to merge.

More detail regarding the changes in this PR:

  • Add credit for sub-quadratic attention to README.md
  • Add a --opt-sub-quad-attention option for enabling the sub-quadration attention optimization. Also adds --sub-quad-q-chunk-size, --sub-quad-kv-chunk-size, and --sub-quad-chunk-threshold options for fine-tuning for memory usage or optimal performance.
  • psutil has been changed to install for all platforms in requirements.txt. This is to allow for sub-quadration attention to set chunk size based on available memory. All the additional checks for the InvokeAI cross attention optimization regarding psutil have been removed.
  • Checking available VRAM has been moved to a function, and Doggettx's cross attention optimization has been modified to use it. A side effect of this is that Doggettx's cross attention optimziation can now work without CUDA. InvokeAI's cross attention is still superior in most cases when not using CUDA however, so it remains the default.

@ClashSAN
Copy link
Collaborator

ClashSAN commented Dec 27, 2022

With my 3050 (4gb) I max out at 960x1024 using just --opt-sub-quad-attention flag

I currently get max 1408x1472 (1472x1472 producing black images) with just --lowvram flag

I currently get max 1472x1536 with just --lowvram --xformers flag

Are you having better speeds with this than --lowvram or --medvram using mac?

I am using DPM++ 2M SDE Karras, 1.X model, xformers compiled by C43H66N12O12S2

@USBhost
Copy link

USBhost commented Dec 27, 2022

So I tested this on my 3080TI and it seems to work but oh man it's snow. And the reason why it's snow is because it's only using 20% of my GPU.

The images look the same at a normal glance with or without this enable. So that's good but if it does infact not change the image unlike --lowvram. Then you'll have a win on that front. But still this is slow at least on my system.

@brkirch
Copy link
Collaborator Author

brkirch commented Dec 27, 2022

I've added --sub-quad-attn-q-chunk-size and --sub-quad-attn-kv-chunk-size options to experiment with.

With my 3050 (4gb) I max out at 960x1024 using just --opt-sub-quad-attention

I currently get 1408x1472 (1472x1472 producing black images) with --lowvram

I currently get 1472x1536 with --lowvram --xformers

Are you having better speeds with this than --lowvram or --medvram using mac?

I am using DPM++ 2M SDE Karras, 1.X model, xformers compiled by C43H66N12O12S2

This is all good info, thank you. I still haven't gotten to test much but it does look like I see a big performance hit with --lowvram. Without --lowvram I seem to get 4x slower or worse speeds. I do however see a big drop in RAM usage. Currently there is a MPS related bug preventing me from successfully generating images with 2^21 or more pixels, so I can't yet increase image size enough to effectively test this.

@ClashSAN
Copy link
Collaborator

ClashSAN commented Dec 27, 2022

ok, just being clear, I did not pair those flags with --opt-sub-quad-attention, and normally the existing cross attention optimization is enabled by default.

I don't know if you mean you're testing --lowvram + the --opt-sub-quad-attention implementation together.

Maybe there is some clear benefit for cpu and mac users at a specific size.
I will test again later.

@brkirch
Copy link
Collaborator Author

brkirch commented Dec 28, 2022

I've just made some changes, try --opt-sub-quad-attention and try --sub-quad-min-chunk-vram-percent with some different numbers (e.g. --sub-quad-min-chunk-vram-percent 75 to use kv chunks that are as large as 75% of available VRAM). Note that available VRAM is checked at the start of each image generation. Using a large percentage of available VRAM should result in much better speeds.

ok, just being clear, I did not pair those flags with --opt-sub-quad-attention, and normally the existing cross attention optimization is enabled by default.

I don't know if you mean you're testing --lowvram + the --opt-sub-quad-attention implementation together.

Yes, I did assume you were using --opt-sub-quad-attention. I recommend against using --lowvram with --opt-sub-quad-attention if possible because --lowvram is probably usually unneeded if -opt-sub-quad-attention is used.

@ClashSAN
Copy link
Collaborator

ClashSAN commented Dec 28, 2022

Testing the new flag, using just --sub-quad-min-chunk-vram-percent 75 , 65 , 55 , 30 not increasing the max image size for me. The maximum is the same as having no flags, at various percentages like --sub-quad-min-chunk-vram-percent 30

is it working well with cpu though?

@brkirch
Copy link
Collaborator Author

brkirch commented Dec 28, 2022

Make sure to have both --opt-sub-quad-attention and --sub-quad-min-chunk-vram-percent, otherwise sub-quad attention isn't even enabled (check your command line output for Applying sub-quadratic cross attention optimization.). I haven't tested CPU (yet) but currently I'm working on a fix for MPS crashing with high resolutions so that I can actually test those.

@ClashSAN
Copy link
Collaborator

ClashSAN commented Dec 28, 2022

It does work, with --opt-sub-quad-attention --sub-quad-min-chunk-vram-percent 75 the maximum sizes were increased from 512x576(default, no flags) to 768x704. Tried different percentages (5, 30, 75), but none changed inference speed or maximum size.

From testing just using --medvram and --lowvram individually, it was faster than --lowvram, and slower than --medvram.

--sub-quad-min-chunk-size I haven't tested this flag, should I pair this with another?

@brkirch
Copy link
Collaborator Author

brkirch commented Dec 28, 2022

Well I finally figured out the issues with generating high resolutions with MPS (I'll add the fix to a PR later), I can now generate at any resolution below 2048x2048 (maybe Stable Diffusion 2.0/2.1 models would work for 2048x2048 or larger, I'll test later). Here's an image generated using highres fix at 2000x2000:
00004-307411557-masterpiece,best quality,rainbow,sky,rain,umbrella,CG,wallpaper,HDR,high quality,high-definition,extremely detailed,1girl
It looks like it might actually be better at resolutions that high to first use the highres fix to increase the resolution about halfway and then use img2img "Just resize (latent upscale)" to increase the resolution further.

@ClashSAN
Copy link
Collaborator

ClashSAN commented Dec 28, 2022

wow! great news for mac users. at that size, did you manage to test --lowvram and --medvram speed also (since you fixed the larger sizes issue) ?

@cgessai
Copy link

cgessai commented Dec 29, 2022

Would this have any benefit for a mobile 4GB GTX 1650 user like me? My ubiquitous card needs --no-half but would crash on load with the two sub-quad arguments in a range from 75 to 30 on the min-chunk. Adding --medvram I could do 320x320 but that was the biggest. On current SD I can usually get 576x576 max and fast (~2 sec/it) performance with command line options --no-half --no-half-vae --medvram --opt-split-attention --xformers. Was hoping this fix might help boost the resolution I can generate in that much quicker (compared to args with --lowvram) modes. My card only has CUDA cores, though. Is this fix only for cards with tensors?

@brkirch
Copy link
Collaborator Author

brkirch commented Dec 30, 2022

@ClashSAN @cgessai I've made what is probably a very important modification for low memory usage, so try using sub-quadratic attention again with the latest changes.

The latest change is also very relevant to Mac users: now if sub-quadratic attention is enabled then large image generation (1472x1472 or larger) should work! @rworne

Would this have any benefit for a mobile 4GB GTX 1650 user like me?

Probably yes, especially with the most recent change. Birch-san noted that this 2048x2048 image was generated with 0.08GB chunks instead of requiring 80GB VRAM for the self-attention matmul. You'll still probably want to experiment with the kv chunk size (--sub-quad-min-chunk-vram-percent or --sub-quad-min-chunk-size) some for optimal speed. I also can't guarantee image generation that large, since there is other overhead. But you'll hopefully be able to generate much larger images than before.

@rworne
Copy link

rworne commented Dec 30, 2022

Correction:

I need to not be so aggressive in my testing and go with a smaller initial image size.

Still just messing with it but a direct comparison with an image size of 1024x1024, highresfix on, and a percentage set to 25%, I'm seeing a marked reduction in RAM usage. About a 30% reduction, or a 30-32GB Python process down to a 20-23GB Python process. This is fantastic progress. Even if the changes lose some efficiency in speed, preventing it from swapping will overcome any slowdown you'd see otherwise. Additionally, my 32GB MacOS install will get unstable if the swap grows too large - say 13-16GB while running SD.

Running again with percentage set to 5.
Prompt:
a girl holding an umbrella in the rain
Steps: 20, Sampler: Euler a, CFG scale: 7, Seed: 1134799079, Size: 1024x1024, Model hash: 6569e224, Denoising strength: 0.4, First pass size: 0x0
00002-1134799079-a girl holding an umbrella in the rain

This took 6 min 30 sec to complete on a 32GB base model Mac Studio. In comparison, prior to this patch I was looking at 30 minutes. RAM usage was mostly between 16 and 19GB with occasional spikes at 21GB. Minimal if no swapping occurred. Memory pressure was green throughout the image generation.

Running on the current branch of Automatic1111, this same prompt took 28 min 30 sec. Swap file was around 4GB, and memory pressure was yellow more than a third of the time.

Old Comment:
This is great news. I tried it out and still am getting large amounts of RAM use & swap. I am getting the cross attention optimization in my console. I'll have to mess with this more later.

I used the two below CLI settings:
--no-half --use-cpu interrogate --opt-sub-quad-attention --sub-quad-min-chunk-vram-percent 25 --skip-torch-cuda-test
--no-half --use-cpu interrogate --opt-sub-quad-attention --sub-quad-min-chunk-vram-percent 75 --skip-torch-cuda-test

Saw no appreciable difference in either.

@Ehplodor
Copy link

@cgessai just for testing, i tried, on my little laptop rtx 3050 4GB VRAM, your combination of parameters "--no-half --no-half-vae --medvram --opt-split-attention --xformers" vs my initial which is only "--xformers" (1.44 s/it with Euler A txt2img 512*512). Actually, generation time increased to >5s/it (same gen. settings). So reverted to --xformers only. Have you tried ?

@ClashSAN
Copy link
Collaborator

ClashSAN commented Dec 30, 2022

hi, I think the percent levels are still not working for cuda.
but it seems --opt-sub-quad-attention is in effect, as my max is 960x1024

using --opt-sub-quad-attention --sub-quad-min-chunk-vram-percent(1,5,25,75)

@rworne
Copy link

rworne commented Dec 30, 2022

Edit:

Some performance stats:

Prompt:
a girl holding an umbrella in the rain
Steps: 20, Sampler: Euler a, CFG scale: 7, Seed: 1134799079, Size: 1024x1024, Model hash: 6569e224, Denoising strength: 0.4, First pass size: 0x0

Run on 32GB Mac Studio, base model:

sub-quad-min-chunk-size 30 7:56
sub-quad-min-chunk-size 27 8:07
sub-quad-min-chunk-size 25 7:29
sub-quad-min-chunk-size 23 7:52
sub-quad-min-chunk-size 21 7:51
sub-quad-min-chunk-size 17 7:31
sub-quad-min-chunk-size 15 7:57
sub-quad-min-chunk-size 10 9:09

These numbers are really good. I'll probably need to run a size larger than 1024x1024, as with the new optimizations, it just seems to comfortably fit within my RAM:

This are a 500-600% speed increase over Automatic1111 (for me). Increase was due to smaller memory footprint keeping the system from swapping. So RAM-bound tasks on a 64GB machine won't likely see any improvement?

sub-quad-min-chunk-vram-percent 1 6:53
sub-quad-min-chunk-vram-percent 5 5:42
sub-quad-min-chunk-vram-percent 10 5:37
sub-quad-min-chunk-vram-percent 15 5:33
sub-quad-min-chunk-vram-percent 25 5:28
sub-quad-min-chunk-vram-percent 50 5:23
sub-quad-min-chunk-vram-percent 75 5:17
sub-quad-min-chunk-vram-percent 85 5:22
sub-quad-min-chunk-vram-percent 95 5:17

Some more quick numbers for 1280x1280:
sub-quad-min-chunk-vram-percent 95 13:55
sub-quad-min-chunk-vram-percent 50 13:00
sub-quad-min-chunk-vram-percent 25 12:51
sub-quad-min-chunk-vram-percent 10 14:51

At these resolutions, I'm looking at 60+ minutes before this patch.

Original:

I tried it with --sub-quad-min-chunk-size 15, and it also renders 1472 x 1472 in approx 30 minutes. Without this patch, I can do 1024x1024 in the same timeframe.

What I am seeing is the memory consumed may not be being returned? Typically when generating an image it grows, but when finished, most of the consumed memory is returned. In the above case, it wasn't. I usually do not see a 100% return, but at least 50% or more. What you see below is what I saw after coming back from shopping, SD finished about 90 minutes earlier.

Screenshot 2022-12-30 at 1 41 22 PM

and after killing the process:

Screenshot 2022-12-30 at 1 42 43 PM

@brkirch
Copy link
Collaborator Author

brkirch commented Dec 31, 2022

I've removed the configuration options, hopefully they will be unneeded. Now just use --opt-sub-quad-attention

hi, I think the percent levels are still not working for cuda. but it seems --opt-sub-quad-attention is in effect, as my max is 960x1024

using --opt-sub-quad-attention --sub-quad-min-chunk-vram-percent(1,5,25,75)

Yes you were correct, the settings were broken. I've actually removed them to try having chunk size adjusted in realtime based on available VRAM. This seems to work quite well for macOS but it will be interesting to see how well it works for CUDA.

@cgessai
Copy link

cgessai commented Dec 31, 2022

@Ehplodor
txt2img 512x512, 20 steps, Euler_A:
--xformers CMD ARG: 50 seconds, 2.54s/it
My 'fast' CMD ARG: 30 seconds, 1.53s/it

Are you sure you didn't get mixed up between seconds/it and its/second? It switches their order back and forth, depending on which is moving more quickly. Because I should not be able to get within ~1/10th of a second of your card. It's anecdotal but every once in a while I've noticed 'better' 4GB cards than mine reporting seemingly worse performance.

@brkirch
First, may sound like a strange question but can you provide me with the syntax for a single GIT command to download your sub-quad_attn_opt branch all in one go and update it when/if you make changes? I've grabbed the new files for these tests but I'm not familiar with GIT and it took some time to make sure I had all the correct files.

Second, I'll leave the detailed notes here but I looked at whether I could obtain any performance boost, specifically, as opposed to enhanced resolution because of an un-noted earlier test were a 1280x1280 crapped out (presumably) dumping the image to pixels. In summary:

  • At 832x832, my 'slow.bat' --lowvram CMD ARGS (see link above, last stanza) was still substantially faster than 75, 20, 10, and 5 vram percent CMD line.
  • When I added my 'fast.bat' CMD ARGS to the end of the two quad attention args at various vram percentages, I was able to preseve the faster speeds I'm faimilar with but, oddly, I was never able to go above 576x576- a limitation I have under regular SD. I had assumed the new method would allow me to sneak higher. In order for my 'fast.bat' to work, I have to have at least **--no-half ** and --medvram because --no-half alone gives me an OOM error- as it does for regular SD. That combo may be existentialy incompatible?

So, for me, I'm not seeing any performance edge yet. I realize that's not exactly what these changes are mainly about but hoped it would be a fringe benefit.

@brkirch
Copy link
Collaborator Author

brkirch commented Jan 1, 2023

@rworne I made a change that should get much better performance, but I'll need some testing done to see if it works well on Macs with less memory than mine has (I have 64 GB). --sub-quad-min-chunk-vram-percent and sub-quad-min-chunk-size have been removed for now, you just need --opt-sub-quad-attention

@brkirch First, may sound like a strange question but can you provide me with the syntax for a single GIT command to download your sub-quad_attn_opt branch all in one go and update it when/if you make changes? I've grabbed the new files for these tests but I'm not familiar with GIT and it took some time to make sure I had all the correct files.

@cgessai If you want to keep it simple, you can clone a second copy of webui UI with the changes:
git clone -b sub-quad_attn_opt https://github.com/brkirch/stable-diffusion-webui
Make sure to clone in a different directory from your existing web UI install. Then you can update with changes anytime with git pull

Second, I'll leave the detailed notes here but I looked at whether I could obtain any performance boost, specifically, as opposed to enhanced resolution because of an un-noted earlier test were a 1280x1280 crapped out (presumably) dumping the image to pixels. In summary:

  • At 832x832, my 'slow.bat' --lowvram CMD ARGS (see link above, last stanza) was still substantially faster than 75, 20, 10, and 5 vram percent CMD line.
  • When I added my 'fast.bat' CMD ARGS to the end of the two quad attention args at various vram percentages, I was able to preseve the faster speeds I'm faimilar with but, oddly, I was never able to go above 576x576- a limitation I have under regular SD. I had assumed the new method would allow me to sneak higher. In order for my 'fast.bat' to work, I have to have at least **--no-half ** and --medvram because --no-half alone gives me an OOM error- as it does for regular SD. That combo may be existentialy incompatible?

So, for me, I'm not seeing any performance edge yet. I realize that's not exactly what these changes are mainly about but hoped it would be a fringe benefit.

Thanks for the benchmarks, but unfortunately I discovered quite a few things broken that actually prevented the chunk settings from working correctly. Also I saw just recently that Birch-san has mentioned that sub-quadratic attention is comparable to xformers - so if you have xformers working it may still be the better option. That said, the latest changes should have much better performance than before if it works on CUDA. It may run out of memory instead - I'll need someone to tell me if it still works or not. Note that the options --sub-quad-min-chunk-vram-percent and sub-quad-min-chunk-size have been removed, at least for now.

@ClashSAN
Copy link
Collaborator

ClashSAN commented Jan 1, 2023

@brkirch if you can't find a tester, I'll be here for you.. just taking my sweet time is all. :)
i'll edit the post with the test of the new commit.

dabcda4
with --opt-sub-quad-attention max size: 704x768

4bfa22e
with --opt-sub-quad-attention max size: 576x640

cadcb36
with --opt-sub-quad-attention max size: 768x832

848605f
with --opt-sub-quad-attention max size: 640x704

@rworne
Copy link

rworne commented Jan 1, 2023

Using commit 4bfa22e
Command line options: --no-half --use-cpu interrogate --opt-sub-quad-attention

1024x1024:

100%|███████████████████████████████████████████| 20/20 [00:19<00:00,  1.02it/s]
100%|███████████████████████████████████████████| 20/20 [03:47<00:00, 11.39s/it]
Total progress: 100%|███████████████████████████| 40/40 [04:10<00:00,  6.27s/it]

Swap grew to 6-7GB, but that was likely stuff other than SD being swapped out. I don't use the Mac when running the speed tests. This is the fastest I have ever gotten an image of this size to draw.

For 1152x1152, it slowed down considerably:

100%|███████████████████████████████████████████| 20/20 [00:33<00:00,  1.68s/it]
100%|███████████████████████████████████████████| 20/20 [18:30<00:00, 55.51s/it]
Total progress: 100%|███████████████████████████| 40/40 [19:22<00:00, 29.07s/it]

For 1280x1280, the performance is much worse. Back to pre-patch performance and a 20GB swap file:

100%|███████████████████████████████████████████| 20/20 [00:19<00:00,  1.03it/s]
100%|███████████████████████████████████████████| 20/20 [30:35<00:00, 91.75s/it]
Total progress: 100%|███████████████████████████| 40/40 [31:12<00:00, 46.81s/it]

So with this patch, I can do 1024x1024 as a max size. Previously 512x768 was roughly about the max of what I could do without swapping.

After running these three images, the Python process is 34GB at the moment and a 16.5GB swap file. Usually it gives some memory back after processing images, but not anymore. Killing the process gives everything back though.

@brkirch brkirch force-pushed the sub-quad_attn_opt branch from ffb424b to cadcb36 Compare January 2, 2023 09:33
@brkirch
Copy link
Collaborator Author

brkirch commented Jan 2, 2023

@rworne @ClashSAN @cgessai I've made some more adjustments, performance should hopefully be similar but with lower memory usage.

@rworne
Copy link

rworne commented Jan 2, 2023

Latest results, pretty much the same. From earlier results I know that 1280x1280 can be nearly halved from where it is now, but this is very good progress.

1024x1024

100%|███████████████████████████████████████████| 20/20 [00:18<00:00,  1.06it/s]
100%|███████████████████████████████████████████| 20/20 [04:25<00:00, 13.25s/it]
Total progress: 100%|███████████████████████████| 40/40 [04:49<00:00,  7.23s/it]

1152x1152

100%|███████████████████████████████████████████| 20/20 [00:19<00:00,  1.03it/s]
100%|███████████████████████████████████████████| 20/20 [15:59<00:00, 47.95s/it]
Total progress: 100%|███████████████████████████| 40/40 [16:36<00:00, 24.91s/it]

1280x1280

100%|███████████████████████████████████████████| 20/20 [00:21<00:00,  1.09s/it]
100%|██████████████████████████████████████████| 20/20 [34:45<00:00, 104.28s/it]
Total progress: 100%|███████████████████████████| 40/40 [35:38<00:00, 53.45s/it]

I'm pretty sure running processes on the Mac are affecting these numeric scores, and did an experiment to check...
I added --listen to the command line options, and connected to Automatic1111 on my iPhone and ran the same thing at 1280x1280 without that memory pig Chrome running (that 1GB of RAM I get back helps a lot). These are the results for 1280x1280:

100%|███████████████████████████████████████████| 20/20 [00:18<00:00,  1.11it/s]
100%|███████████████████████████████████████████| 20/20 [21:30<00:00, 64.54s/it]
Total progress: 100%|███████████████████████████| 40/40 [22:04<00:00, 33.10s/it]

@brkirch
Copy link
Collaborator Author

brkirch commented Jan 4, 2023

I've got something more for Mac users to test. (@rworne)

This fork of PyTorch gets ~25% better performance for GPU acceleration in macOS. Memory usage is much lower too. To install it, make sure Xcode is up to date, update with the latest changes in this PR, run brew install pkg-config libuv and then open webui-user.sh in Xcode and replace its contents with:

Expand
#!/bin/bash
#########################################################
# Uncomment and change the variables below to your need:#
#########################################################

# Install directory without trailing slash
#install_dir="/home/$(whoami)"

# Name of the subdirectory
#clone_dir="stable-diffusion-webui"

# Commandline arguments for webui.py, for example: export COMMANDLINE_ARGS="--medvram --opt-split-attention"
export COMMANDLINE_ARGS="$COMMANDLINE_ARGS --opt-sub-quad-attention"

# python3 executable
#python_cmd="python3"

# git executable
#export GIT="git"

# python3 venv without trailing slash (defaults to ${install_dir}/${clone_dir}/venv)
venv_dir="venv-torch-2.0-alpha"

# script to launch to start the app
#export LAUNCH_SCRIPT="launch.py"

# install command for torch
export USE_DISTRIBUTED=1
export TORCH_COMMAND="pip install git+https://github.com/kulinseth/pytorch@d5e2e4e14ba4e0ceeef84bab3a486b189050669e#egg=torch --pre torchvision==0.15.0.dev20230103 -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html"

# Requirements file to use for stable-diffusion-webui
#export REQS_FILE="requirements_versions.txt"

# Fixed git repos
#export K_DIFFUSION_PACKAGE=""
#export GFPGAN_PACKAGE=""

# Fixed git commits
#export STABLE_DIFFUSION_COMMIT_HASH=""
#export TAMING_TRANSFORMERS_COMMIT_HASH=""
#export CODEFORMER_COMMIT_HASH=""
#export BLIP_COMMIT_HASH=""

# Uncomment to enable accelerated launch
#export ACCELERATE="True"

###########################################

The first time you run ./webui.sh it will take a long time at Installing torch and torchvision; this is normal because it is building PyTorch from source instead of installing an already built package.

From the testing I did I noticed training embeddings and hypernetworks was broken but as far as I could tell all other features still seemed to function correctly. If any macOS users that try this could let me know if it helps with performance and memory usage for them when using sub-quadratic attention it would be much appreciated.

@brkirch brkirch force-pushed the sub-quad_attn_opt branch 2 times, most recently from b8d8b6d to 7020a68 Compare January 4, 2023 12:17
@rworne
Copy link

rworne commented Jan 4, 2023

EDIT:
I missed a step in your instructions, I still need to update/install libuv. I'll update later when I get it done.
EDIT2:
Got it done. I do not see any appreciable difference with it. The crash when "Upscale by = 2" still occurs. 2nd pass dies at 10%

Ah, I figured out how the new hiresfix works. To do the old way: Set your image size to 512x512, then use Upscale by as a multiplication factor. The crash was occurring because 2048x2048 is too big for my setup.

So 512x512 and upscale by 2 gives a 1024x1024 image:
a girl holding an umbrella in the rain
Steps: 20, Sampler: Euler a, CFG scale: 7, Seed: 1134799079, Size: 512x512, Model hash: 6569e224, Denoising strength: 0.4, Hires upscale: 2, Hires upscaler: Latent
Time taken: 2m 32.37s

This is half the time of the optimized version! Incredible! But the upscaler isn't the same.
00657-1134799079-a girl holding an umbrella in the rain

However changing the upscale it to ESRGAN_4x gives similar results and a couple seconds more time to finish:
a girl holding an umbrella in the rain
Steps: 20, Sampler: Euler a, CFG scale: 7, Seed: 1134799079, Size: 512x512, Model hash: 6569e224, Denoising strength: 0.4, Hires upscale: 2, Hires upscaler: ESRGAN_4x
Time taken: 2m 36.87s
00658-1134799079-a girl holding an umbrella in the rain

100%|███████████████████████████████████████████| 20/20 [00:14<00:00,  1.35it/s]
100%|███████████████████████████████████████████| 20/20 [02:15<00:00,  6.76s/it]
Total progress: 100%|███████████████████████████| 40/40 [02:36<00:00,  3.90s/it]

ORIGINAL:

I've got something more for Mac users to test. (@rworne)

Loaded it up. It's faster:

for 512x512:

100%|███████████████████████████████████████████| 20/20 [00:15<00:00,  1.31it/s]
Total progress: 100%|███████████████████████████| 20/20 [00:15<00:00,  1.33it/s]

A modest improvement.

But it has issues:

  1. Won't render anything if the image size will cause the old NDArray > 2**32 error. So 1024x1024 bombs out halfway.
  2. Hires fix does not rescale the original image? a 512x512 in the old version will scale up to a larger version of the same, here it's a whole new image - and it's slower.

For example:
512x512 render is 1.3 it/s
1024x1024 with hires fix is 7.11 s/it
It however crashes on the 2nd pass.

With hiresfix checked and upscale by on the default of 2, it crashes 10% of the way through the drawing:

100%|███████████████████████████████████████████| 20/20 [02:11<00:00,  6.55s/it]
 10%|████▎                                      | 2/20 [05:17<46:59, 156.64s/it]-[_MTLCommandBuffer addCompletedHandler:]:855: failed assertion `Completed handler provided after commit call'
./webui.sh: line 169: 36108 Abort trap: 6           "${python_cmd}" "${LAUNCH_SCRIPT}" "$@"
Robert@Mac-Studio stable-diffusion-webui % /opt/homebrew/Cellar/[email protected]/3.10.9/Frameworks/Python.framework/Versions/3.10/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

With upscale set to 1, it doesn't crash:

Here is 1016x1016:
100%|███████████████████████████████████████████| 20/20 [02:10<00:00,  6.54s/it]
100%|███████████████████████████████████████████| 20/20 [02:10<00:00,  6.51s/it]
Total progress: 100%|███████████████████████████| 40/40 [04:17<00:00,  6.44s/it]

I think there's no significant change in the end. If the first pass were faster, then it'd shave off quite a bit of time. The memory usage was very smooth, with no heavy spikes - Python was 18GB on average, never went above 20.

This is weird - with hiresfix off, it seems to generate usable images in one pass. Here's two runs:

100%|███████████████████████████████████████████| 20/20 [02:11<00:00,  6.55s/it]
Total progress: 100%|███████████████████████████| 20/20 [02:07<00:00,  6.40s/it]
100%|███████████████████████████████████████████| 20/20 [02:10<00:00,  6.53s/it]
Total progress: 100%|███████████████████████████| 20/20 [02:07<00:00,  6.38s/it]

Prompt:
a girl holding an umbrella in the rain
Steps: 20, Sampler: Euler a, CFG scale: 7, Seed: 1134799079, Size: 1016x1016, Model hash: 6569e224
Time taken: 2m 12.99s

And here's the image:
00650-1134799079-a girl holding an umbrella in the rain

and this one:
a girl holding an umbrella in the rain
Steps: 20, Sampler: Euler a, CFG scale: 7, Seed: 1134799079, Size: 1016x1016, Model hash: 6569e224, Denoising strength: 0.4, Hires upscale: 1, Hires upscaler: Latent
Time taken: 4m 25.12s
00655-1134799079-a girl holding an umbrella in the rain

So with hiresfix off, it behaves exactly like the old version, with it on, the image generated is nothing like the original it is trying to enlarge (neither is like the original 512x512 image).

@brkirch brkirch force-pushed the sub-quad_attn_opt branch from 3c78db7 to b119815 Compare January 6, 2023 05:15
@brkirch brkirch marked this pull request as ready for review January 6, 2023 06:56
@brkirch brkirch requested a review from AUTOMATIC1111 as a code owner January 6, 2023 06:56
@AUTOMATIC1111
Copy link
Owner

It looks like this code https://github.com/AminRezaei0x443/memory-efficient-attention/tree/1bc0d9e6ac5f82ea43a375135c4e1d3896ee1694 is a pip library; is it possible to just use that without copying code? If not, can we copy code from that unlicensed repo?

@brkirch brkirch marked this pull request as draft January 6, 2023 10:52
@brkirch
Copy link
Collaborator Author

brkirch commented Jan 6, 2023

We do need the modifications, both for performance (Birch-san noted a 2.78x speedup) and Mac support. We are probably a lot better off for now just getting permission to use this implementation. If the package is later updated with similar performance and Mac support then we can use that instead.

@brkirch brkirch force-pushed the sub-quad_attn_opt branch from 354d626 to 848605f Compare January 6, 2023 14:41
@AUTOMATIC1111
Copy link
Owner

AUTOMATIC1111 commented Jan 6, 2023

I got excited about 2.78x speedup, but I guess you mean for osx. I did make some pictures with this, my test was four batches of 768x768 pics, on GTX3090.

xformers does it in 20.5s, Doggettx's is 24.7s, and this PR's is 31.33s. I get same numbers if I try again. So no speedup on nvidia. Ran it with just --opt-sub-quad-attention.

@rworne
Copy link

rworne commented Jan 6, 2023

I forgot to add the new command line option when fixing the git issue.

Using same prompt & settings, upscaler = latent and upscale by = 1.

a girl holding an umbrella in the rain
Steps: 20, Sampler: Euler a, CFG scale: 7, Seed: 1134799079, Size: 512x512, Model hash: 6569e224, Denoising strength: 0.4, Hires resize: 1024x1024, Hires upscaler: Latent
Time taken: 3m 47.67s

Python 3.10.9 (main, Dec 15 2022, 17:11:09) [Clang 14.0.0 (clang-1400.0.29.202)]
Commit hash: 848605fb654a55ee6947335d7df6e13366606fad
Installing requirements for Web UI
Launching Web UI with arguments: --no-half --use-cpu interrogate --opt-sub-quad-attention
Warning: caught exception 'Torch not compiled with CUDA enabled', memory monitor disabled
No module 'xformers'. Proceeding without it.
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Loading weights [6569e224] from /Volumes/SSD/Developer/test/stable-diffusion-webui/models/Stable-diffusion/Anything 3.0/Anything-V3.0.ckpt
Loading VAE weights from: /Volumes/SSD/Developer/test/stable-diffusion-webui/models/VAE/Anything-V3.0.vae.pt
Applying sub-quadratic cross attention optimization.
Textual inversion embeddings loaded(0): 
Model loaded.
Warning: Bad ui setting value: img2img/Mask mode/value: Draw mask; Default value "Inpaint masked" will be used instead.
Running on local URL:  https://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
100%|███████████████████████████████████████████| 20/20 [00:18<00:00,  1.06it/s]
100%|███████████████████████████████████████████| 20/20 [03:25<00:00, 10.30s/it]
Total progress: 100%|███████████████████████████| 40/40 [03:47<00:00,  5.69s/it]

With Upscale by = 2:

100%|███████████████████████████████████████████| 20/20 [00:19<00:00,  1.04it/s]
100%|███████████████████████████████████████████| 20/20 [03:42<00:00, 11.12s/it]
Total progress: 100%|███████████████████████████| 40/40 [04:05<00:00,  6.13s/it]

I can't say It's faster. I also see there may be command line options to tune it. I'll be working with it over the weekend seeing what I can do with it.

@brkirch brkirch marked this pull request as ready for review January 6, 2023 21:43
@brkirch
Copy link
Collaborator Author

brkirch commented Jan 6, 2023

The MIT license now applies to the Memory Efficient Attention package, and Birch-san has given permission for releasing their modified implementation under a permissive license as well. I've added the license to this PR. As far as licensing goes, this PR should be ready to merge.

I got excited about 2.78x speedup, but I guess you mean for osx.

I meant compared to the unmodified Memory Efficient Attention package. In my testing, sub-quadratic attention is slightly slower than the InvokeAI cross attention optimization, but with much better memory management. Birch-san mentioned sub-quadratic attention is essentially the same as xformers, but xformers is optimized for CUDA so it isn't too surprising it is faster.

Sub-quadratic attention seems to be best for Mac users and users without CUDA (or with CUDA that can't get xformers working). On Macs it wasn't possible before to generate images of 1472x1472 resolution or higher (regardless of available memory, the MPS backend would crash) but with sub-quadratic attention all resolutions under 2048x2048 work correctly.

@fractal-fumbler
Copy link

also add that sub-quadratic attention make it possible for me on amd gpu (gfx 1031 - not supported by rocm atm) to use --no-half and --precision full :)
before i got OOM error.

@AUTOMATIC1111 AUTOMATIC1111 merged commit c295e4a into AUTOMATIC1111:master Jan 7, 2023
@AUTOMATIC1111
Copy link
Owner

Will someone write about this optimization for the wiki page?

https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Optimizations

@jn-jairo
Copy link
Collaborator

jn-jairo commented Jan 7, 2023

I tried it, but the --lowvram --opt-split-attention-v1 still the best in my case (2GB GPU), the max size is 576x576 with both, but the --opt-split-attention-v1 is faster.

I also tried the other quad parameters but I couldn't get higher than 576x576.

@brkirch
Copy link
Collaborator Author

brkirch commented Jan 7, 2023

Try --opt-sub-quad-attention --sub-quad-chunk-threshold 0

--sub-quad-chunk-threshold 0 disables unchunked attention and will use very small chunks to keep VRAM usage to a minimum. You can also adjust --sub-quad-q-chunk-size (default is 1024) and try values like 512 or 256 if that didn't help. --lowvram and --medvram are still supported as well, so you can still add either of those options.

@jn-jairo
Copy link
Collaborator

jn-jairo commented Jan 8, 2023

Try --opt-sub-quad-attention --sub-quad-chunk-threshold 0

--sub-quad-chunk-threshold 0 disables unchunked attention and will use very small chunks to keep VRAM usage to a minimum. You can also adjust --sub-quad-q-chunk-size (default is 1024) and try values like 512 or 256 if that didn't help. --lowvram and --medvram are still supported as well, so you can still add either of those options.

Same thing, It don't get higher size than --opt-split-attention-v1, and it is slower. Maybe it is because of the --lowvram but I get out of memory without it right on the start, so that is it, I guess it helps with other configurations but it is not the best option for extreme low vram like my case.

@tusharbhutt
Copy link

Same thing, It don't get higher size than --opt-split-attention-v1, and it is slower. Maybe it is because of the --lowvram but I get out of memory without it right on the start, so that is it, I guess it helps with other configurations but it is not the best option for extreme low vram like my case.

I'm in the same boat on a 12GB 3060, there is little improvement compared to standard optimization. Under cross-optimization I can get 2688x1792, with this, it goes up to 2720x1816 but only if I reduce the steps to five

I've noticed that some recent commit has really upped VRAM usage as I have run out of VRAM a couple of times, which never happened before and was also before this commit, so maybe there's a VRAM leak somewhere.

@bach777
Copy link

bach777 commented Jan 10, 2023

  • --opt-sub-quad-attention

--xformers and --opt-sub-quad-attention can run at the same time?
Applying xformers cross attention optimization.
I'm just getting this activated

@brkirch
Copy link
Collaborator Author

brkirch commented Jan 10, 2023

No, but if you have xformers working then you probably don't want to use sub-quadratic attention. Performance will be worse and it likely won't make a huge difference in the size of images you can generate.

@mclsugi
Copy link

mclsugi commented Jan 27, 2023

Thank you for the --upcast-sampling @brkirch !

It's on par with using --precision full --no-half in terms of performance for my 1070Ti but with VRAM usage of FP16!
Pascal based card don't need --precision full but it's performance is worse since it can only running it via emulation (NVIDIA avoiding another sales down in their compute card lineup after the Titan cards success).

I don't have times to play around much atm to test further but so far the experience is great. Some tasks that usually guaranteed to OOM with precision full, like for instance Aesthetic Gradients Clip, were running just fine with --upcast-sampling. So, thank you once again!

@cgessai
Copy link

cgessai commented Feb 1, 2023

@mclsugi If you have a sec, what does your full command line arg look like? Also, what's your seconds per iteration (sec/it) for your choice of sampler at 512x512 for 10-20 steps? (please name the sampler and number of steps, obviously) Thanks!

EDIT: --upcast-sampling is only for Mac?

@FNSpd
Copy link
Contributor

FNSpd commented Mar 3, 2023

Thank you for the --upcast-sampling @brkirch !

It's on par with using --precision full --no-half in terms of performance for my 1070Ti but with VRAM usage of FP16! Pascal based card don't need --precision full but it's performance is worse since it can only running it via emulation (NVIDIA avoiding another sales down in their compute card lineup after the Titan cards success).

I don't have times to play around much atm to test further but so far the experience is great. Some tasks that usually guaranteed to OOM with precision full, like for instance Aesthetic Gradients Clip, were running just fine with --upcast-sampling. So, thank you once again!

How did you manage to get upcast working with CUDA card? My GTX 1650 doesn't seem to work with it

@mclsugi
Copy link

mclsugi commented Mar 28, 2023

@mclsugi If you have a sec, what does your full command line arg look like? Also, what's your seconds per iteration (sec/it) for your choice of sampler at 512x512 for 10-20 steps? (please name the sampler and number of steps, obviously) Thanks!

EDIT: --upcast-sampling is only for Mac?

Sorry for the super late responses, as I don't touch SD since exactly that day until now (fresh install) as I had real life situations. I was using either Manjaro (Linux) or Windows 10. With this upcast / or plain xformers+precision full+no half I've got around 2.7-2.8 it/s for single batch, and way faster for 8 batch (7sec for 1 image, 32-34 for 8 image). I did have a lot modifications / playing around which sadly I didn't documented.
Nothing special in terms of config, just --xformers or upcast-sampling with relevant config. Just use some common sense when applying cmd arguments, oh and OC vram a bit to 400. The rest of my setting I guess irrelevant (theme etc).

How did you manage to get upcast working with CUDA card? My GTX 1650 doesn't seem to work with it

16** series have their set of problems for this kind of jobs if I'm not mistaken.

@FNSpd
Copy link
Contributor

FNSpd commented Mar 28, 2023

@mclsugi If you have a sec, what does your full command line arg look like? Also, what's your seconds per iteration (sec/it) for your choice of sampler at 512x512 for 10-20 steps? (please name the sampler and number of steps, obviously) Thanks!
EDIT: --upcast-sampling is only for Mac?

Sorry for the super late responses, as I don't touch SD since exactly that day until now (fresh install) as I had real life situations. I was using either Manjaro (Linux) or Windows 10. With this upcast / or plain xformers+precision full+no half I've got around 2.7-2.8 it/s for single batch, and way faster for 8 batch (7sec for 1 image, 32-34 for 8 image). I did have a lot modifications / playing around which sadly I didn't documented. Nothing special in terms of config, just --xformers or upcast-sampling with relevant config. Just use some common sense when applying cmd arguments, oh and OC vram a bit to 400. The rest of my setting I guess irrelevant (theme etc).

How did you manage to get upcast working with CUDA card? My GTX 1650 doesn't seem to work with it

16** series have their set of problems for this kind of jobs if I'm not mistaken.

I already figured it out. Had to make some changes in code but it works now. Thanks for response, though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.