Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ring allreduce in multitensor #3000

Merged
merged 26 commits into from
Mar 20, 2024
Merged

Ring allreduce in multitensor #3000

merged 26 commits into from
Mar 20, 2024

Conversation

uuuvn
Copy link
Contributor

@uuuvn uuuvn commented Jan 4, 2024

  • Basic ring_allreduce impl
  • Fix last failing test
  • Fuzz test it against naive implementation
  • Make code look good
  • Benchmark to make sure that it's actually faster
  • Hopefully merge

@uuuvn uuuvn marked this pull request as draft January 4, 2024 01:10
@uuuvn
Copy link
Contributor Author

uuuvn commented Jan 4, 2024

The last failing test is def test_simple_reduce_1(self): return self._test_simple_reduce_axis(1), i suspect that it is because something is wrong with the way i chunk lazybuffers

@uuuvn
Copy link
Contributor Author

uuuvn commented Jan 4, 2024

Very strange: when i called function with the same LazyBuffers but created from numpy to reproduce & debug everything is ok
Problem also disappears when i add lbs = [lb.contiguous() for lb in lbs] at the start of a function to force realize lazybuffers.
Some bugs in lazy/jit/whatever? Maybe i'm doing something that somehow breaks shapetracker?

@g1y5x3
Copy link
Contributor

g1y5x3 commented Jan 4, 2024

so I was referring to chunk in tensor.py but now the question is where is the appropriate place for chunk? also, you can refer to extra/dist/collectives.py for making it multiprocess. Once I have the multigpu training benchmarks, it would be easier to test the implementation.

@uuuvn
Copy link
Contributor Author

uuuvn commented Jan 4, 2024

Very strange: when i called function with the same LazyBuffers but created from numpy to reproduce & debug everything is ok Problem also disappears when i add lbs = [lb.contiguous() for lb in lbs] at the start of a function to force realize lazybuffers. Some bugs in lazy/jit/whatever? Maybe i'm doing something that somehow breaks shapetracker?

The problem also disappears if i use clang or llvm backend instead of metal....

@uuuvn uuuvn marked this pull request as ready for review January 4, 2024 06:40
@geohot geohot added the bounty locked Bounty is locked to someone label Jan 4, 2024
@uuuvn uuuvn marked this pull request as draft January 5, 2024 02:59
@uuuvn uuuvn force-pushed the ring_allreduce branch 4 times, most recently from cd8e4f6 to d47ec27 Compare January 9, 2024 01:53
@uuuvn uuuvn marked this pull request as ready for review January 9, 2024 09:57
@uuuvn
Copy link
Contributor Author

uuuvn commented Jan 9, 2024

So... The only thing left is to make sure that it is actually faster, which at least on my m1 machine isn't the case.

Can somebody run python test/test_multitensor_speed.py on a 4+ GPU machine?

@geohot
Copy link
Collaborator

geohot commented Jan 9, 2024

How is the benchmarking coming?

@uuuvn
Copy link
Contributor Author

uuuvn commented Jan 10, 2024

How is the benchmarking coming?

Can somebody run python test/test_multitensor_speed.py on a 4+ GPU machine?

Or is benchmark something else than just a python script that outputs results?

@geohot
Copy link
Collaborator

geohot commented Jan 10, 2024

Ahh, running on tiny9

tiny@tiny9:~/tinygrad$ python test/test_multitensor_speed.py
Iter (r) 0/10: 8.868210792541504
Iter (n) 0/10: 1.705146074295044
Iter (r) 1/10: 0.35244202613830566
Iter (n) 1/10: 0.49947524070739746
Iter (r) 2/10: 0.3451864719390869
Iter (n) 2/10: 0.4998817443847656
Iter (r) 3/10: 0.3451802730560303
Iter (n) 3/10: 0.49986791610717773
Iter (r) 4/10: 0.3446967601776123
Iter (n) 4/10: 0.5003104209899902
Iter (r) 5/10: 0.34472179412841797
Iter (n) 5/10: 0.5005695819854736
Iter (r) 6/10: 0.3451967239379883
Iter (n) 6/10: 0.4998791217803955
Iter (r) 7/10: 0.34467482566833496
Iter (n) 7/10: 0.49942994117736816
Iter (r) 8/10: 0.3446054458618164
Iter (n) 8/10: 0.4994697570800781
Iter (r) 9/10: 0.3446621894836426
Iter (n) 9/10: 0.4994165897369385
Average: 0.6203446388244629 naive vs 1.1979577302932738 ring

Does look faster. What's the units for this? And why so many floating point digits!

@geohot
Copy link
Collaborator

geohot commented Jan 10, 2024

Hmm, ran a second time, slower

tiny@tiny9:~/tinygrad$ python test/test_multitensor_speed.py
Iter (r) 0/10: 0.6794774532318115
Iter (n) 0/10: 0.5218009948730469
Iter (r) 1/10: 0.37977075576782227
Iter (n) 1/10: 0.536362886428833
Iter (r) 2/10: 0.3786904811859131
Iter (n) 2/10: 1.2099571228027344
Iter (r) 3/10: 0.3500063419342041
Iter (n) 3/10: 0.49944448471069336
Iter (r) 4/10: 0.3451883792877197
Iter (n) 4/10: 0.49990272521972656
Iter (r) 5/10: 0.3456103801727295
Iter (n) 5/10: 0.5003464221954346
Iter (r) 6/10: 0.34472084045410156
Iter (n) 6/10: 0.49947381019592285
Iter (r) 7/10: 0.3455815315246582
Iter (n) 7/10: 0.4994392395019531
Iter (r) 8/10: 0.3451573848724365
Iter (n) 8/10: 0.49997830390930176
Iter (r) 9/10: 0.3456544876098633
Iter (n) 9/10: 0.49947142601013184
Average: 0.5766177415847777 naive vs 0.38598580360412593 ring

@uuuvn
Copy link
Contributor Author

uuuvn commented Jan 10, 2024

Average: 0.6203446388244629 naive vs 1.1979577302932738 ring

It's actually slower....
0.6 seconds for neive and 1.19 for ring

@uuuvn
Copy link
Contributor Author

uuuvn commented Jan 10, 2024

The first one is old impl - the second one is ring one...

@uuuvn
Copy link
Contributor Author

uuuvn commented Jan 10, 2024

So the first attempt was slower and second was faster. Can you run it a couple more times?

@uuuvn
Copy link
Contributor Author

uuuvn commented Jan 10, 2024

tiny@tiny9:~/tinygrad$ python test/test_multitensor_speed.py
Iter (r) 0/10: 8.868210792541504
Iter (n) 0/10: 1.705146074295044
Iter (r) 1/10: 0.35244202613830566
Iter (n) 1/10: 0.49947524070739746
Iter (r) 2/10: 0.3451864719390869
Iter (n) 2/10: 0.4998817443847656

It looks like the first one time it was slow, but after that ring was faster

@uuuvn
Copy link
Contributor Author

uuuvn commented Jan 10, 2024

Iter (r) 0/10: 8.868210792541504
Iter (n) 0/10: 1.705146074295044

That's why on the first run it was slower on average despite after that being faster:

Iter (r) 1/10: 0.37977075576782227
Iter (n) 1/10: 0.536362886428833
Iter (r) 2/10: 0.3786904811859131
Iter (n) 2/10: 1.2099571228027344
Iter (r) 3/10: 0.3500063419342041
Iter (n) 3/10: 0.49944448471069336
Iter (r) 4/10: 0.3451883792877197
Iter (n) 4/10: 0.49990272521972656
Iter (r) 5/10: 0.3456103801727295
Iter (n) 5/10: 0.5003464221954346
Iter (r) 6/10: 0.34472084045410156
Iter (n) 6/10: 0.49947381019592285
Iter (r) 7/10: 0.3455815315246582
Iter (n) 7/10: 0.4994392395019531
Iter (r) 8/10: 0.3451573848724365
Iter (n) 8/10: 0.49997830390930176
Iter (r) 9/10: 0.3456544876098633
Iter (n) 9/10: 0.49947142601013184

I'll just fix the test to discard the first run

@uuuvn
Copy link
Contributor Author

uuuvn commented Jan 10, 2024

@geohot made fix to discard results from the first iteration which is for some reason much slower.

Can you test it again?

@geohot
Copy link
Collaborator

geohot commented Jan 10, 2024

Your script isn't printing GB/s. What GB/s do you expect?

You can see the tinybox bandwidths between GPUs in "AMD Benchmark" CI

@uuuvn
Copy link
Contributor Author

uuuvn commented Jan 10, 2024

Your script isn't printing GB/s. What GB/s do you expect?

You can see the tinybox bandwidths between GPUs in "AMD Benchmark" CI

I've rewritten benchmark to output Gb/s and GFLOP/s

@uuuvn
Copy link
Contributor Author

uuuvn commented Jan 10, 2024

Also renamed benchmark script to test/bench_multitensor_speed.py, otherwise it's being ran by CI

@geohot
Copy link
Collaborator

geohot commented Jan 15, 2024

tiny@tiny9:~/tinygrad$ python3 test/bench_multitensor_speed.py
naive_allreduce iter 0/10: 0.71 sec 9.07 GFLOP/s 108.85 Gb/s
naive_allreduce iter 1/10: 0.65 sec 9.88 GFLOP/s 118.53 Gb/s
naive_allreduce iter 2/10: 0.65 sec 9.89 GFLOP/s 118.71 Gb/s
naive_allreduce iter 3/10: 0.65 sec 9.91 GFLOP/s 118.88 Gb/s
naive_allreduce iter 4/10: 0.65 sec 9.91 GFLOP/s 118.89 Gb/s
naive_allreduce iter 5/10: 0.65 sec 9.89 GFLOP/s 118.69 Gb/s
naive_allreduce iter 6/10: 0.65 sec 9.91 GFLOP/s 118.89 Gb/s
naive_allreduce iter 7/10: 0.65 sec 9.89 GFLOP/s 118.71 Gb/s
naive_allreduce iter 8/10: 0.65 sec 9.89 GFLOP/s 118.71 Gb/s
naive_allreduce iter 9/10: 0.65 sec 9.89 GFLOP/s 118.68 Gb/s
ring_allreduce iter 0/10: 0.45 sec 116.63 GFLOP/s 242.78 Gb/s
ring_allreduce iter 1/10: 0.45 sec 117.74 GFLOP/s 245.10 Gb/s
ring_allreduce iter 2/10: 0.44 sec 118.51 GFLOP/s 246.69 Gb/s
ring_allreduce iter 3/10: 0.44 sec 118.50 GFLOP/s 246.67 Gb/s
ring_allreduce iter 4/10: 0.44 sec 118.38 GFLOP/s 246.42 Gb/s
ring_allreduce iter 5/10: 0.44 sec 118.44 GFLOP/s 246.55 Gb/s
ring_allreduce iter 6/10: 0.44 sec 118.39 GFLOP/s 246.45 Gb/s
ring_allreduce iter 7/10: 0.44 sec 118.46 GFLOP/s 246.60 Gb/s
ring_allreduce iter 8/10: 0.44 sec 118.47 GFLOP/s 246.61 Gb/s
ring_allreduce iter 9/10: 0.44 sec 118.33 GFLOP/s 246.32 Gb/s
Naive:
  0.66 seconds/iter
  9.81 GFLOP/s
  117.75 Gb/s
Ring:
  0.45 seconds/iter
  118.18 GFLOP/s
  246.02 Gb/s

Gb/s or GB/s? What should it be in theory?

@chenyuxyz chenyuxyz added this to the llama2 70B FP16 10 tok/s milestone Mar 10, 2024
@chenyuxyz
Copy link
Collaborator

@nimlgen can you review this one and see how much it overlaps with #3626 and which approach (or both) we will keep long term

@geohot
Copy link
Collaborator

geohot commented Mar 10, 2024

Can confirm it improves speed of cifar a little

tiny@tiny12:~/tinygrad$ HSA=1 HALF=1 STEPS=350 BS=1536 GPUS=6 TARGET_EVAL_ACC_PCT=93 python3 examples/hlb_cifar10.py
shuffling training dataset in 958.98 ms (epoch=0)
  0 18045.31 ms run, 18039.51 ms python,    5.80 ms HSA * 6, 1197.00 loss, 0.000043 LR, 0.56 GB used,    111.74 GFLOPS,   2016.42 GOPS
  1 1139.61 ms run, 1136.97 ms python,    2.65 ms HSA * 6, 1195.00 loss, 0.000085 LR, 6.63 GB used,   1767.76 GFLOPS,   2014.56 GOPS
  2   53.81 ms run,   13.86 ms python,   39.96 ms HSA * 6, 1178.00 loss, 0.000128 LR, 6.64 GB used,  37436.16 GFLOPS,   2014.56 GOPS
  3   45.62 ms run,   12.35 ms python,   33.27 ms HSA * 6, 1163.00 loss, 0.000171 LR, 6.64 GB used,  44158.40 GFLOPS,   2014.56 GOPS
  4   46.78 ms run,   13.41 ms python,   33.37 ms HSA * 6, 1161.00 loss, 0.000214 LR, 6.64 GB used,  43068.90 GFLOPS,   2014.56 GOPS
  5   45.42 ms run,   12.30 ms python,   33.11 ms HSA * 6, 1156.00 loss, 0.000256 LR, 6.64 GB used,  44355.12 GFLOPS,   2014.56 GOPS
  6   45.32 ms run,   12.29 ms python,   33.03 ms HSA * 6, 1138.00 loss, 0.000299 LR, 6.64 GB used,  44451.15 GFLOPS,   2014.56 GOPS

tinygrad/realize.py Outdated Show resolved Hide resolved
@nimlgen
Copy link
Collaborator

nimlgen commented Mar 14, 2024

profiled llama, seems that delay between packets in sdma queues is slightly higher than aql. I am going to measure the performance impact of moving some copies to aql queues

@uuuvn
Copy link
Contributor Author

uuuvn commented Mar 16, 2024

image

@uuuvn
Copy link
Contributor Author

uuuvn commented Mar 16, 2024

39 ms down from 47 ms on hlb_cifar with ring

@uuuvn
Copy link
Contributor Author

uuuvn commented Mar 16, 2024

LLaMA slowness is fixed. Ready to review i think

But #3560 is required for benchmark on HSA not to crash and there is a over 6500 lines issue

@chenyuxyz
Copy link
Collaborator

3560 is merged. Is this ready? I see cifar is slightly faster, llama is the same.

python test/external/external_benchmark_multitensor_allreduce.py

...
ring_allreduce iter 9/10: 0.085602 sec 102.22 GFLOP/s 11.68 GB/s
ring_allreduce iter 10/10: 0.085636 sec 102.18 GFLOP/s 11.68 GB/s
Traceback (most recent call last):
  File "/home/chenyu/tinygrad/test/external/external_benchmark_multitensor_allreduce.py", line 63, in <module>
    main()
  File "/home/chenyu/tinygrad/test/external/external_benchmark_multitensor_allreduce.py", line 58, in main
    (naive_gflops, naive_gbs, naive_secs) = test(naive_allreduce, devs, N)
  File "/home/chenyu/tinygrad/test/external/external_benchmark_multitensor_allreduce.py", line 32, in test
    realize(_jitted(impl, ReduceOps.SUM, Tensor(MultiLazyBuffer(lbs, 0), device=devs)).lazydata.lbs)
  File "/home/chenyu/tinygrad/tinygrad/features/jit.py", line 115, in __call__
    for ji in self.jit_cache: ji.prg(cast(List[Buffer], ji.rawbufs), var_vals, wait=DEBUG>=2, jit=True)
  File "/home/chenyu/tinygrad/tinygrad/runtime/graph/hsa.py", line 130, in __call__
    elif i == 1: self.transfers[self.ji_to_transfer[j]][2] = input_rawbuffers[input_idx]._buf # src
IndexError: list index out of range

@uuuvn
Copy link
Contributor Author

uuuvn commented Mar 18, 2024

3560 is merged. Is this ready? I see cifar is slightly faster, llama is the same.

python test/external/external_benchmark_multitensor_allreduce.py

...
ring_allreduce iter 9/10: 0.085602 sec 102.22 GFLOP/s 11.68 GB/s
ring_allreduce iter 10/10: 0.085636 sec 102.18 GFLOP/s 11.68 GB/s
Traceback (most recent call last):
  File "/home/chenyu/tinygrad/test/external/external_benchmark_multitensor_allreduce.py", line 63, in <module>
    main()
  File "/home/chenyu/tinygrad/test/external/external_benchmark_multitensor_allreduce.py", line 58, in main
    (naive_gflops, naive_gbs, naive_secs) = test(naive_allreduce, devs, N)
  File "/home/chenyu/tinygrad/test/external/external_benchmark_multitensor_allreduce.py", line 32, in test
    realize(_jitted(impl, ReduceOps.SUM, Tensor(MultiLazyBuffer(lbs, 0), device=devs)).lazydata.lbs)
  File "/home/chenyu/tinygrad/tinygrad/features/jit.py", line 115, in __call__
    for ji in self.jit_cache: ji.prg(cast(List[Buffer], ji.rawbufs), var_vals, wait=DEBUG>=2, jit=True)
  File "/home/chenyu/tinygrad/tinygrad/runtime/graph/hsa.py", line 130, in __call__
    elif i == 1: self.transfers[self.ji_to_transfer[j]][2] = input_rawbuffers[input_idx]._buf # src
IndexError: list index out of range

It is ready except this hsa-transfer thing. (naive allreduce in my benchmark triggers hsa bug)

Looks like there is something wrong with #3560 , it fixed this before but it looks like @nimlgen changed it and now there is something else wrong with JIT transfers...

return [functools.reduce(lambda x,y: x.e(bop, y), [x.copy_to_device(lb.device) for x in lbs]) for lb in lbs]

n_lbs, dim = len(lbs), prod(lbs[0].shape)
# Ring allreduce doesn't provide a benefit with only 2 nodes or where number of elements is less than 256k (empirically)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skipping 2 nodes makes sense. Can you show how you derived the 256k number.

Also what is expected to hit in CI benchmark? I see an improvement in test/external/external_benchmark_multitensor_allreduce.py and cifar training. Does it hit llama or resnet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show how you derived the 256k number.

By running it with different sizes and seeing what's faster

Also what is expected to hit in CI benchmark? I see an improvement in test/external/external_benchmark_multitensor_allreduce.py and cifar training.

Should be +-12GB/s in benchmark

Does it hit llama or resnet?

LLaMA no (small allreduce sizes), resnet idk

Copy link
Contributor

This branch currently is behind tinygrad/master. The line count difference bot is disabled.

@geohot geohot merged commit c5bf9e4 into tinygrad:master Mar 20, 2024
16 checks passed
geohot added a commit that referenced this pull request Mar 20, 2024
geohot added a commit that referenced this pull request Mar 20, 2024
geohot added a commit that referenced this pull request Mar 21, 2024
@uuuvn uuuvn deleted the ring_allreduce branch March 22, 2024 19:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bounty locked Bounty is locked to someone
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

None yet

5 participants