-
Notifications
You must be signed in to change notification settings - Fork 27.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Birch-san's sub-quadratic attention implementation #6055
Conversation
7f4eb85
to
3505fe7
Compare
With my 3050 (4gb) I max out at 960x1024 using just I currently get max 1408x1472 (1472x1472 producing black images) with just I currently get max 1472x1536 with just Are you having better speeds with this than --lowvram or --medvram using mac? I am using DPM++ 2M SDE Karras, 1.X model, xformers compiled by C43H66N12O12S2 |
So I tested this on my 3080TI and it seems to work but oh man it's snow. And the reason why it's snow is because it's only using 20% of my GPU. The images look the same at a normal glance with or without this enable. So that's good but if it does infact not change the image unlike --lowvram. Then you'll have a win on that front. But still this is slow at least on my system. |
I've added
This is all good info, thank you. I still haven't gotten to test much but it does look like I see a big performance hit with |
ok, just being clear, I did not pair those flags with I don't know if you mean you're testing Maybe there is some clear benefit for cpu and mac users at a specific size. |
I've just made some changes, try
Yes, I did assume you were using |
Testing the new flag, using just is it working well with cpu though? |
Make sure to have both |
It does work, with From testing just using --medvram and --lowvram individually, it was faster than --lowvram, and slower than --medvram.
|
wow! great news for mac users. at that size, did you manage to test --lowvram and --medvram speed also (since you fixed the larger sizes issue) ? |
Would this have any benefit for a mobile 4GB GTX 1650 user like me? My ubiquitous card needs --no-half but would crash on load with the two sub-quad arguments in a range from 75 to 30 on the min-chunk. Adding --medvram I could do 320x320 but that was the biggest. On current SD I can usually get 576x576 max and fast (~2 sec/it) performance with command line options --no-half --no-half-vae --medvram --opt-split-attention --xformers. Was hoping this fix might help boost the resolution I can generate in that much quicker (compared to args with --lowvram) modes. My card only has CUDA cores, though. Is this fix only for cards with tensors? |
@ClashSAN @cgessai I've made what is probably a very important modification for low memory usage, so try using sub-quadratic attention again with the latest changes. The latest change is also very relevant to Mac users: now if sub-quadratic attention is enabled then large image generation (1472x1472 or larger) should work! @rworne
Probably yes, especially with the most recent change. Birch-san noted that this 2048x2048 image was generated with 0.08GB chunks instead of requiring 80GB VRAM for the self-attention matmul. You'll still probably want to experiment with the kv chunk size ( |
c1d9190
to
1824569
Compare
@cgessai just for testing, i tried, on my little laptop rtx 3050 4GB VRAM, your combination of parameters "--no-half --no-half-vae --medvram --opt-split-attention --xformers" vs my initial which is only "--xformers" (1.44 s/it with Euler A txt2img 512*512). Actually, generation time increased to >5s/it (same gen. settings). So reverted to --xformers only. Have you tried ? |
hi, I think the percent levels are still not working for cuda. using |
Edit: Some performance stats: Prompt: Run on 32GB Mac Studio, base model: sub-quad-min-chunk-size 30 7:56 These numbers are really good. I'll probably need to run a size larger than 1024x1024, as with the new optimizations, it just seems to comfortably fit within my RAM: This are a 500-600% speed increase over Automatic1111 (for me). Increase was due to smaller memory footprint keeping the system from swapping. So RAM-bound tasks on a 64GB machine won't likely see any improvement? sub-quad-min-chunk-vram-percent 1 6:53 Some more quick numbers for 1280x1280: At these resolutions, I'm looking at 60+ minutes before this patch. Original: I tried it with --sub-quad-min-chunk-size 15, and it also renders 1472 x 1472 in approx 30 minutes. Without this patch, I can do 1024x1024 in the same timeframe. What I am seeing is the memory consumed may not be being returned? Typically when generating an image it grows, but when finished, most of the consumed memory is returned. In the above case, it wasn't. I usually do not see a 100% return, but at least 50% or more. What you see below is what I saw after coming back from shopping, SD finished about 90 minutes earlier. and after killing the process: |
I've removed the configuration options, hopefully they will be unneeded. Now just use
Yes you were correct, the settings were broken. I've actually removed them to try having chunk size adjusted in realtime based on available VRAM. This seems to work quite well for macOS but it will be interesting to see how well it works for CUDA. |
@Ehplodor Are you sure you didn't get mixed up between seconds/it and its/second? It switches their order back and forth, depending on which is moving more quickly. Because I should not be able to get within ~1/10th of a second of your card. It's anecdotal but every once in a while I've noticed 'better' 4GB cards than mine reporting seemingly worse performance. @brkirch Second, I'll leave the detailed notes here but I looked at whether I could obtain any performance boost, specifically, as opposed to enhanced resolution because of an un-noted earlier test were a 1280x1280 crapped out (presumably) dumping the image to pixels. In summary:
So, for me, I'm not seeing any performance edge yet. I realize that's not exactly what these changes are mainly about but hoped it would be a fringe benefit. |
@rworne I made a change that should get much better performance, but I'll need some testing done to see if it works well on Macs with less memory than mine has (I have 64 GB).
@cgessai If you want to keep it simple, you can clone a second copy of webui UI with the changes:
Thanks for the benchmarks, but unfortunately I discovered quite a few things broken that actually prevented the chunk settings from working correctly. Also I saw just recently that Birch-san has mentioned that sub-quadratic attention is comparable to xformers - so if you have xformers working it may still be the better option. That said, the latest changes should have much better performance than before if it works on CUDA. It may run out of memory instead - I'll need someone to tell me if it still works or not. Note that the options |
@brkirch if you can't find a tester, I'll be here for you.. just taking my sweet time is all. :) dabcda4 4bfa22e cadcb36 848605f |
Using commit 4bfa22e 1024x1024:
Swap grew to 6-7GB, but that was likely stuff other than SD being swapped out. I don't use the Mac when running the speed tests. This is the fastest I have ever gotten an image of this size to draw. For 1152x1152, it slowed down considerably:
For 1280x1280, the performance is much worse. Back to pre-patch performance and a 20GB swap file:
So with this patch, I can do 1024x1024 as a max size. Previously 512x768 was roughly about the max of what I could do without swapping. After running these three images, the Python process is 34GB at the moment and a 16.5GB swap file. Usually it gives some memory back after processing images, but not anymore. Killing the process gives everything back though. |
ffb424b
to
cadcb36
Compare
Latest results, pretty much the same. From earlier results I know that 1280x1280 can be nearly halved from where it is now, but this is very good progress. 1024x1024
1152x1152
1280x1280
I'm pretty sure running processes on the Mac are affecting these numeric scores, and did an experiment to check...
|
I've got something more for Mac users to test. (@rworne) This fork of PyTorch gets ~25% better performance for GPU acceleration in macOS. Memory usage is much lower too. To install it, make sure Xcode is up to date, update with the latest changes in this PR, run Expand#!/bin/bash
#########################################################
# Uncomment and change the variables below to your need:#
#########################################################
# Install directory without trailing slash
#install_dir="/home/$(whoami)"
# Name of the subdirectory
#clone_dir="stable-diffusion-webui"
# Commandline arguments for webui.py, for example: export COMMANDLINE_ARGS="--medvram --opt-split-attention"
export COMMANDLINE_ARGS="$COMMANDLINE_ARGS --opt-sub-quad-attention"
# python3 executable
#python_cmd="python3"
# git executable
#export GIT="git"
# python3 venv without trailing slash (defaults to ${install_dir}/${clone_dir}/venv)
venv_dir="venv-torch-2.0-alpha"
# script to launch to start the app
#export LAUNCH_SCRIPT="launch.py"
# install command for torch
export USE_DISTRIBUTED=1
export TORCH_COMMAND="pip install git+https://github.com/kulinseth/pytorch@d5e2e4e14ba4e0ceeef84bab3a486b189050669e#egg=torch --pre torchvision==0.15.0.dev20230103 -f https://download.pytorch.org/whl/nightly/cpu/torch_nightly.html"
# Requirements file to use for stable-diffusion-webui
#export REQS_FILE="requirements_versions.txt"
# Fixed git repos
#export K_DIFFUSION_PACKAGE=""
#export GFPGAN_PACKAGE=""
# Fixed git commits
#export STABLE_DIFFUSION_COMMIT_HASH=""
#export TAMING_TRANSFORMERS_COMMIT_HASH=""
#export CODEFORMER_COMMIT_HASH=""
#export BLIP_COMMIT_HASH=""
# Uncomment to enable accelerated launch
#export ACCELERATE="True"
###########################################
The first time you run From the testing I did I noticed training embeddings and hypernetworks was broken but as far as I could tell all other features still seemed to function correctly. If any macOS users that try this could let me know if it helps with performance and memory usage for them when using sub-quadratic attention it would be much appreciated. |
b8d8b6d
to
7020a68
Compare
EDIT: Ah, I figured out how the new hiresfix works. To do the old way: Set your image size to 512x512, then use Upscale by as a multiplication factor. The crash was occurring because 2048x2048 is too big for my setup. So 512x512 and upscale by 2 gives a 1024x1024 image: This is half the time of the optimized version! Incredible! But the upscaler isn't the same. However changing the upscale it to ESRGAN_4x gives similar results and a couple seconds more time to finish:
ORIGINAL:
Loaded it up. It's faster: for 512x512:
A modest improvement. But it has issues:
For example: With hiresfix checked and upscale by on the default of 2, it crashes 10% of the way through the drawing:
With upscale set to 1, it doesn't crash:
I think there's no significant change in the end. If the first pass were faster, then it'd shave off quite a bit of time. The memory usage was very smooth, with no heavy spikes - Python was 18GB on average, never went above 20. This is weird - with hiresfix off, it seems to generate usable images in one pass. Here's two runs:
Prompt: and this one: So with hiresfix off, it behaves exactly like the old version, with it on, the image generated is nothing like the original it is trying to enlarge (neither is like the original 512x512 image). |
3c78db7
to
b119815
Compare
It looks like this code https://github.com/AminRezaei0x443/memory-efficient-attention/tree/1bc0d9e6ac5f82ea43a375135c4e1d3896ee1694 is a pip library; is it possible to just use that without copying code? If not, can we copy code from that unlicensed repo? |
We do need the modifications, both for performance (Birch-san noted a 2.78x speedup) and Mac support. We are probably a lot better off for now just getting permission to use this implementation. If the package is later updated with similar performance and Mac support then we can use that instead. |
354d626
to
848605f
Compare
I got excited about 2.78x speedup, but I guess you mean for osx. I did make some pictures with this, my test was four batches of 768x768 pics, on GTX3090. xformers does it in 20.5s, Doggettx's is 24.7s, and this PR's is 31.33s. I get same numbers if I try again. So no speedup on nvidia. Ran it with just --opt-sub-quad-attention. |
I forgot to add the new command line option when fixing the git issue. Using same prompt & settings, upscaler = latent and upscale by = 1. a girl holding an umbrella in the rain
With Upscale by = 2:
I can't say It's faster. I also see there may be command line options to tune it. I'll be working with it over the weekend seeing what I can do with it. |
The MIT license now applies to the Memory Efficient Attention package, and Birch-san has given permission for releasing their modified implementation under a permissive license as well. I've added the license to this PR. As far as licensing goes, this PR should be ready to merge.
I meant compared to the unmodified Memory Efficient Attention package. In my testing, sub-quadratic attention is slightly slower than the InvokeAI cross attention optimization, but with much better memory management. Birch-san mentioned sub-quadratic attention is essentially the same as xformers, but xformers is optimized for CUDA so it isn't too surprising it is faster. Sub-quadratic attention seems to be best for Mac users and users without CUDA (or with CUDA that can't get xformers working). On Macs it wasn't possible before to generate images of 1472x1472 resolution or higher (regardless of available memory, the MPS backend would crash) but with sub-quadratic attention all resolutions under 2048x2048 work correctly. |
also add that sub-quadratic attention make it possible for me on amd gpu (gfx 1031 - not supported by rocm atm) to use --no-half and --precision full :) |
Will someone write about this optimization for the wiki page? https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Optimizations |
I tried it, but the I also tried the other quad parameters but I couldn't get higher than 576x576. |
Try
|
Same thing, It don't get higher size than |
I'm in the same boat on a 12GB 3060, there is little improvement compared to standard optimization. Under cross-optimization I can get 2688x1792, with this, it goes up to 2720x1816 but only if I reduce the steps to five I've noticed that some recent commit has really upped VRAM usage as I have run out of VRAM a couple of times, which never happened before and was also before this commit, so maybe there's a VRAM leak somewhere. |
--xformers and --opt-sub-quad-attention can run at the same time? |
No, but if you have xformers working then you probably don't want to use sub-quadratic attention. Performance will be worse and it likely won't make a huge difference in the size of images you can generate. |
Thank you for the It's on par with using I don't have times to play around much atm to test further but so far the experience is great. Some tasks that usually guaranteed to OOM with precision full, like for instance Aesthetic Gradients Clip, were running just fine with |
@mclsugi If you have a sec, what does your full command line arg look like? Also, what's your seconds per iteration (sec/it) for your choice of sampler at 512x512 for 10-20 steps? (please name the sampler and number of steps, obviously) Thanks! EDIT: --upcast-sampling is only for Mac? |
How did you manage to get upcast working with CUDA card? My GTX 1650 doesn't seem to work with it |
Sorry for the super late responses, as I don't touch SD since exactly that day until now (fresh install) as I had real life situations. I was using either Manjaro (Linux) or Windows 10. With this upcast / or plain xformers+precision full+no half I've got around 2.7-2.8 it/s for single batch, and way faster for 8 batch (7sec for 1 image, 32-34 for 8 image). I did have a lot modifications / playing around which sadly I didn't documented.
16** series have their set of problems for this kind of jobs if I'm not mistaken. |
I already figured it out. Had to make some changes in code but it works now. Thanks for response, though |
This is a highly memory efficient cross attention optimization that enables high resolution image generation with much less VRAM.
For more details see:
https://twitter.com/Birchlabs/status/1607503573906063362
Birch-san/diffusers#1
Edit: MIT license added. As far as licensing goes, this PR should be ready to merge.
More detail regarding the changes in this PR:
README.md
--opt-sub-quad-attention
option for enabling the sub-quadration attention optimization. Also adds--sub-quad-q-chunk-size
,--sub-quad-kv-chunk-size
, and--sub-quad-chunk-threshold
options for fine-tuning for memory usage or optimal performance.psutil
has been changed to install for all platforms in requirements.txt. This is to allow for sub-quadration attention to set chunk size based on available memory. All the additional checks for the InvokeAI cross attention optimization regarding psutil have been removed.