-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ring allreduce in multitensor #3000
Conversation
uuuvn
commented
Jan 4, 2024
•
edited
Loading
edited
- Basic ring_allreduce impl
- Fix last failing test
- Fuzz test it against naive implementation
- Make code look good
- Benchmark to make sure that it's actually faster
- Hopefully merge
The last failing test is |
Very strange: when i called function with the same LazyBuffers but created from numpy to reproduce & debug everything is ok |
so I was referring to |
The problem also disappears if i use clang or llvm backend instead of metal.... |
cd8e4f6
to
d47ec27
Compare
So... The only thing left is to make sure that it is actually faster, which at least on my m1 machine isn't the case. Can somebody run |
How is the benchmarking coming? |
Or is benchmark something else than just a python script that outputs results? |
Ahh, running on tiny9
Does look faster. What's the units for this? And why so many floating point digits! |
Hmm, ran a second time, slower
|
Average: 0.6203446388244629 naive vs 1.1979577302932738 ring It's actually slower.... |
The first one is old impl - the second one is ring one... |
So the first attempt was slower and second was faster. Can you run it a couple more times? |
It looks like the first one time it was slow, but after that ring was faster |
That's why on the first run it was slower on average despite after that being faster:
I'll just fix the test to discard the first run |
@geohot made fix to discard results from the first iteration which is for some reason much slower. Can you test it again? |
Your script isn't printing GB/s. What GB/s do you expect? You can see the tinybox bandwidths between GPUs in "AMD Benchmark" CI |
I've rewritten benchmark to output Gb/s and GFLOP/s |
Also renamed benchmark script to |
Gb/s or GB/s? What should it be in theory? |
Can confirm it improves speed of cifar a little
|
profiled llama, seems that delay between packets in sdma queues is slightly higher than aql. I am going to measure the performance impact of moving some copies to aql queues |
39 ms down from 47 ms on hlb_cifar with ring |
LLaMA slowness is fixed. Ready to review i think But #3560 is required for benchmark on HSA not to crash and there is a over 6500 lines issue |
3560 is merged. Is this ready? I see cifar is slightly faster, llama is the same.
|
It is ready except this hsa-transfer thing. (naive allreduce in my benchmark triggers hsa bug) Looks like there is something wrong with #3560 , it fixed this before but it looks like @nimlgen changed it and now there is something else wrong with JIT transfers... |
return [functools.reduce(lambda x,y: x.e(bop, y), [x.copy_to_device(lb.device) for x in lbs]) for lb in lbs] | ||
|
||
n_lbs, dim = len(lbs), prod(lbs[0].shape) | ||
# Ring allreduce doesn't provide a benefit with only 2 nodes or where number of elements is less than 256k (empirically) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skipping 2 nodes makes sense. Can you show how you derived the 256k number.
Also what is expected to hit in CI benchmark? I see an improvement in test/external/external_benchmark_multitensor_allreduce.py
and cifar training. Does it hit llama or resnet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show how you derived the 256k number.
By running it with different sizes and seeing what's faster
Also what is expected to hit in CI benchmark? I see an improvement in test/external/external_benchmark_multitensor_allreduce.py and cifar training.
Should be +-12GB/s in benchmark
Does it hit llama or resnet?
LLaMA no (small allreduce sizes), resnet idk
This branch currently is behind tinygrad/master. The line count difference bot is disabled. |