Ring allreduce in multitensor #3000

uuuvn · 2024-01-04T01:08:01Z

Basic ring_allreduce impl
Fix last failing test
Fuzz test it against naive implementation
Make code look good
Benchmark to make sure that it's actually faster
Hopefully merge

uuuvn · 2024-01-04T01:12:20Z

The last failing test is def test_simple_reduce_1(self): return self._test_simple_reduce_axis(1), i suspect that it is because something is wrong with the way i chunk lazybuffers

uuuvn · 2024-01-04T02:36:30Z

Very strange: when i called function with the same LazyBuffers but created from numpy to reproduce & debug everything is ok
Problem also disappears when i add lbs = [lb.contiguous() for lb in lbs] at the start of a function to force realize lazybuffers.
Some bugs in lazy/jit/whatever? Maybe i'm doing something that somehow breaks shapetracker?

g1y5x3 · 2024-01-04T02:43:37Z

so I was referring to chunk in tensor.py but now the question is where is the appropriate place for chunk? also, you can refer to extra/dist/collectives.py for making it multiprocess. Once I have the multigpu training benchmarks, it would be easier to test the implementation.

uuuvn · 2024-01-04T02:45:03Z

Very strange: when i called function with the same LazyBuffers but created from numpy to reproduce & debug everything is ok Problem also disappears when i add lbs = [lb.contiguous() for lb in lbs] at the start of a function to force realize lazybuffers. Some bugs in lazy/jit/whatever? Maybe i'm doing something that somehow breaks shapetracker?

The problem also disappears if i use clang or llvm backend instead of metal....

uuuvn · 2024-01-09T10:06:06Z

So... The only thing left is to make sure that it is actually faster, which at least on my m1 machine isn't the case.

Can somebody run python test/test_multitensor_speed.py on a 4+ GPU machine?

geohot · 2024-01-09T18:29:54Z

How is the benchmarking coming?

uuuvn · 2024-01-10T01:21:44Z

How is the benchmarking coming?

Can somebody run python test/test_multitensor_speed.py on a 4+ GPU machine?

Or is benchmark something else than just a python script that outputs results?

geohot · 2024-01-10T01:58:32Z

Ahh, running on tiny9

tiny@tiny9:~/tinygrad$ python test/test_multitensor_speed.py
Iter (r) 0/10: 8.868210792541504
Iter (n) 0/10: 1.705146074295044
Iter (r) 1/10: 0.35244202613830566
Iter (n) 1/10: 0.49947524070739746
Iter (r) 2/10: 0.3451864719390869
Iter (n) 2/10: 0.4998817443847656
Iter (r) 3/10: 0.3451802730560303
Iter (n) 3/10: 0.49986791610717773
Iter (r) 4/10: 0.3446967601776123
Iter (n) 4/10: 0.5003104209899902
Iter (r) 5/10: 0.34472179412841797
Iter (n) 5/10: 0.5005695819854736
Iter (r) 6/10: 0.3451967239379883
Iter (n) 6/10: 0.4998791217803955
Iter (r) 7/10: 0.34467482566833496
Iter (n) 7/10: 0.49942994117736816
Iter (r) 8/10: 0.3446054458618164
Iter (n) 8/10: 0.4994697570800781
Iter (r) 9/10: 0.3446621894836426
Iter (n) 9/10: 0.4994165897369385
Average: 0.6203446388244629 naive vs 1.1979577302932738 ring

Does look faster. What's the units for this? And why so many floating point digits!

geohot · 2024-01-10T01:59:56Z

Hmm, ran a second time, slower

tiny@tiny9:~/tinygrad$ python test/test_multitensor_speed.py
Iter (r) 0/10: 0.6794774532318115
Iter (n) 0/10: 0.5218009948730469
Iter (r) 1/10: 0.37977075576782227
Iter (n) 1/10: 0.536362886428833
Iter (r) 2/10: 0.3786904811859131
Iter (n) 2/10: 1.2099571228027344
Iter (r) 3/10: 0.3500063419342041
Iter (n) 3/10: 0.49944448471069336
Iter (r) 4/10: 0.3451883792877197
Iter (n) 4/10: 0.49990272521972656
Iter (r) 5/10: 0.3456103801727295
Iter (n) 5/10: 0.5003464221954346
Iter (r) 6/10: 0.34472084045410156
Iter (n) 6/10: 0.49947381019592285
Iter (r) 7/10: 0.3455815315246582
Iter (n) 7/10: 0.4994392395019531
Iter (r) 8/10: 0.3451573848724365
Iter (n) 8/10: 0.49997830390930176
Iter (r) 9/10: 0.3456544876098633
Iter (n) 9/10: 0.49947142601013184
Average: 0.5766177415847777 naive vs 0.38598580360412593 ring

uuuvn · 2024-01-10T02:00:09Z

Average: 0.6203446388244629 naive vs 1.1979577302932738 ring

It's actually slower....
0.6 seconds for neive and 1.19 for ring

uuuvn · 2024-01-10T02:01:00Z

The first one is old impl - the second one is ring one...

uuuvn · 2024-01-10T02:01:40Z

So the first attempt was slower and second was faster. Can you run it a couple more times?

uuuvn · 2024-01-10T02:02:31Z

tiny@tiny9:~/tinygrad$ python test/test_multitensor_speed.py
Iter (r) 0/10: 8.868210792541504
Iter (n) 0/10: 1.705146074295044
Iter (r) 1/10: 0.35244202613830566
Iter (n) 1/10: 0.49947524070739746
Iter (r) 2/10: 0.3451864719390869
Iter (n) 2/10: 0.4998817443847656

It looks like the first one time it was slow, but after that ring was faster

uuuvn · 2024-01-10T02:04:16Z

Iter (r) 0/10: 8.868210792541504
Iter (n) 0/10: 1.705146074295044

That's why on the first run it was slower on average despite after that being faster:

Iter (r) 1/10: 0.37977075576782227
Iter (n) 1/10: 0.536362886428833
Iter (r) 2/10: 0.3786904811859131
Iter (n) 2/10: 1.2099571228027344
Iter (r) 3/10: 0.3500063419342041
Iter (n) 3/10: 0.49944448471069336
Iter (r) 4/10: 0.3451883792877197
Iter (n) 4/10: 0.49990272521972656
Iter (r) 5/10: 0.3456103801727295
Iter (n) 5/10: 0.5003464221954346
Iter (r) 6/10: 0.34472084045410156
Iter (n) 6/10: 0.49947381019592285
Iter (r) 7/10: 0.3455815315246582
Iter (n) 7/10: 0.4994392395019531
Iter (r) 8/10: 0.3451573848724365
Iter (n) 8/10: 0.49997830390930176
Iter (r) 9/10: 0.3456544876098633
Iter (n) 9/10: 0.49947142601013184

I'll just fix the test to discard the first run

uuuvn · 2024-01-10T02:07:42Z

@geohot made fix to discard results from the first iteration which is for some reason much slower.

Can you test it again?

geohot · 2024-01-10T03:36:20Z

Your script isn't printing GB/s. What GB/s do you expect?

You can see the tinybox bandwidths between GPUs in "AMD Benchmark" CI

uuuvn · 2024-01-10T09:27:15Z

Your script isn't printing GB/s. What GB/s do you expect?

You can see the tinybox bandwidths between GPUs in "AMD Benchmark" CI

I've rewritten benchmark to output Gb/s and GFLOP/s

uuuvn · 2024-01-10T09:45:09Z

Also renamed benchmark script to test/bench_multitensor_speed.py, otherwise it's being ran by CI

geohot · 2024-01-15T04:43:33Z

tiny@tiny9:~/tinygrad$ python3 test/bench_multitensor_speed.py
naive_allreduce iter 0/10: 0.71 sec 9.07 GFLOP/s 108.85 Gb/s
naive_allreduce iter 1/10: 0.65 sec 9.88 GFLOP/s 118.53 Gb/s
naive_allreduce iter 2/10: 0.65 sec 9.89 GFLOP/s 118.71 Gb/s
naive_allreduce iter 3/10: 0.65 sec 9.91 GFLOP/s 118.88 Gb/s
naive_allreduce iter 4/10: 0.65 sec 9.91 GFLOP/s 118.89 Gb/s
naive_allreduce iter 5/10: 0.65 sec 9.89 GFLOP/s 118.69 Gb/s
naive_allreduce iter 6/10: 0.65 sec 9.91 GFLOP/s 118.89 Gb/s
naive_allreduce iter 7/10: 0.65 sec 9.89 GFLOP/s 118.71 Gb/s
naive_allreduce iter 8/10: 0.65 sec 9.89 GFLOP/s 118.71 Gb/s
naive_allreduce iter 9/10: 0.65 sec 9.89 GFLOP/s 118.68 Gb/s
ring_allreduce iter 0/10: 0.45 sec 116.63 GFLOP/s 242.78 Gb/s
ring_allreduce iter 1/10: 0.45 sec 117.74 GFLOP/s 245.10 Gb/s
ring_allreduce iter 2/10: 0.44 sec 118.51 GFLOP/s 246.69 Gb/s
ring_allreduce iter 3/10: 0.44 sec 118.50 GFLOP/s 246.67 Gb/s
ring_allreduce iter 4/10: 0.44 sec 118.38 GFLOP/s 246.42 Gb/s
ring_allreduce iter 5/10: 0.44 sec 118.44 GFLOP/s 246.55 Gb/s
ring_allreduce iter 6/10: 0.44 sec 118.39 GFLOP/s 246.45 Gb/s
ring_allreduce iter 7/10: 0.44 sec 118.46 GFLOP/s 246.60 Gb/s
ring_allreduce iter 8/10: 0.44 sec 118.47 GFLOP/s 246.61 Gb/s
ring_allreduce iter 9/10: 0.44 sec 118.33 GFLOP/s 246.32 Gb/s
Naive:
  0.66 seconds/iter
  9.81 GFLOP/s
  117.75 Gb/s
Ring:
  0.45 seconds/iter
  118.18 GFLOP/s
  246.02 Gb/s

Gb/s or GB/s? What should it be in theory?

chenyuxyz · 2024-03-10T17:07:18Z

@nimlgen can you review this one and see how much it overlaps with #3626 and which approach (or both) we will keep long term

geohot · 2024-03-10T23:01:13Z

Can confirm it improves speed of cifar a little

tiny@tiny12:~/tinygrad$ HSA=1 HALF=1 STEPS=350 BS=1536 GPUS=6 TARGET_EVAL_ACC_PCT=93 python3 examples/hlb_cifar10.py
shuffling training dataset in 958.98 ms (epoch=0)
  0 18045.31 ms run, 18039.51 ms python,    5.80 ms HSA * 6, 1197.00 loss, 0.000043 LR, 0.56 GB used,    111.74 GFLOPS,   2016.42 GOPS
  1 1139.61 ms run, 1136.97 ms python,    2.65 ms HSA * 6, 1195.00 loss, 0.000085 LR, 6.63 GB used,   1767.76 GFLOPS,   2014.56 GOPS
  2   53.81 ms run,   13.86 ms python,   39.96 ms HSA * 6, 1178.00 loss, 0.000128 LR, 6.64 GB used,  37436.16 GFLOPS,   2014.56 GOPS
  3   45.62 ms run,   12.35 ms python,   33.27 ms HSA * 6, 1163.00 loss, 0.000171 LR, 6.64 GB used,  44158.40 GFLOPS,   2014.56 GOPS
  4   46.78 ms run,   13.41 ms python,   33.37 ms HSA * 6, 1161.00 loss, 0.000214 LR, 6.64 GB used,  43068.90 GFLOPS,   2014.56 GOPS
  5   45.42 ms run,   12.30 ms python,   33.11 ms HSA * 6, 1156.00 loss, 0.000256 LR, 6.64 GB used,  44355.12 GFLOPS,   2014.56 GOPS
  6   45.32 ms run,   12.29 ms python,   33.03 ms HSA * 6, 1138.00 loss, 0.000299 LR, 6.64 GB used,  44451.15 GFLOPS,   2014.56 GOPS

tinygrad/realize.py

nimlgen · 2024-03-14T11:08:43Z

profiled llama, seems that delay between packets in sdma queues is slightly higher than aql. I am going to measure the performance impact of moving some copies to aql queues

uuuvn · 2024-03-16T08:28:46Z

uuuvn · 2024-03-16T08:29:34Z

39 ms down from 47 ms on hlb_cifar with ring

uuuvn · 2024-03-16T09:14:40Z

LLaMA slowness is fixed. Ready to review i think

But #3560 is required for benchmark on HSA not to crash and there is a over 6500 lines issue

chenyuxyz · 2024-03-18T16:11:28Z

3560 is merged. Is this ready? I see cifar is slightly faster, llama is the same.

python test/external/external_benchmark_multitensor_allreduce.py

...
ring_allreduce iter 9/10: 0.085602 sec 102.22 GFLOP/s 11.68 GB/s
ring_allreduce iter 10/10: 0.085636 sec 102.18 GFLOP/s 11.68 GB/s
Traceback (most recent call last):
  File "/home/chenyu/tinygrad/test/external/external_benchmark_multitensor_allreduce.py", line 63, in <module>
    main()
  File "/home/chenyu/tinygrad/test/external/external_benchmark_multitensor_allreduce.py", line 58, in main
    (naive_gflops, naive_gbs, naive_secs) = test(naive_allreduce, devs, N)
  File "/home/chenyu/tinygrad/test/external/external_benchmark_multitensor_allreduce.py", line 32, in test
    realize(_jitted(impl, ReduceOps.SUM, Tensor(MultiLazyBuffer(lbs, 0), device=devs)).lazydata.lbs)
  File "/home/chenyu/tinygrad/tinygrad/features/jit.py", line 115, in __call__
    for ji in self.jit_cache: ji.prg(cast(List[Buffer], ji.rawbufs), var_vals, wait=DEBUG>=2, jit=True)
  File "/home/chenyu/tinygrad/tinygrad/runtime/graph/hsa.py", line 130, in __call__
    elif i == 1: self.transfers[self.ji_to_transfer[j]][2] = input_rawbuffers[input_idx]._buf # src
IndexError: list index out of range

uuuvn · 2024-03-18T16:21:06Z

3560 is merged. Is this ready? I see cifar is slightly faster, llama is the same.

python test/external/external_benchmark_multitensor_allreduce.py

...
ring_allreduce iter 9/10: 0.085602 sec 102.22 GFLOP/s 11.68 GB/s
ring_allreduce iter 10/10: 0.085636 sec 102.18 GFLOP/s 11.68 GB/s
Traceback (most recent call last):
  File "/home/chenyu/tinygrad/test/external/external_benchmark_multitensor_allreduce.py", line 63, in <module>
    main()
  File "/home/chenyu/tinygrad/test/external/external_benchmark_multitensor_allreduce.py", line 58, in main
    (naive_gflops, naive_gbs, naive_secs) = test(naive_allreduce, devs, N)
  File "/home/chenyu/tinygrad/test/external/external_benchmark_multitensor_allreduce.py", line 32, in test
    realize(_jitted(impl, ReduceOps.SUM, Tensor(MultiLazyBuffer(lbs, 0), device=devs)).lazydata.lbs)
  File "/home/chenyu/tinygrad/tinygrad/features/jit.py", line 115, in __call__
    for ji in self.jit_cache: ji.prg(cast(List[Buffer], ji.rawbufs), var_vals, wait=DEBUG>=2, jit=True)
  File "/home/chenyu/tinygrad/tinygrad/runtime/graph/hsa.py", line 130, in __call__
    elif i == 1: self.transfers[self.ji_to_transfer[j]][2] = input_rawbuffers[input_idx]._buf # src
IndexError: list index out of range

It is ready except this hsa-transfer thing. (naive allreduce in my benchmark triggers hsa bug)

Looks like there is something wrong with #3560 , it fixed this before but it looks like @nimlgen changed it and now there is something else wrong with JIT transfers...

chenyuxyz · 2024-03-18T20:39:08Z

tinygrad/features/multi.py

- return [functools.reduce(lambda x,y: x.e(bop, y), [x.copy_to_device(lb.device) for x in lbs]) for lb in lbs]
+
+ n_lbs, dim = len(lbs), prod(lbs[0].shape)
+ # Ring allreduce doesn't provide a benefit with only 2 nodes or where number of elements is less than 256k (empirically)


skipping 2 nodes makes sense. Can you show how you derived the 256k number.

Also what is expected to hit in CI benchmark? I see an improvement in test/external/external_benchmark_multitensor_allreduce.py and cifar training. Does it hit llama or resnet?

Can you show how you derived the 256k number.

By running it with different sizes and seeing what's faster

Also what is expected to hit in CI benchmark? I see an improvement in test/external/external_benchmark_multitensor_allreduce.py and cifar training.

Should be +-12GB/s in benchmark

Does it hit llama or resnet?

LLaMA no (small allreduce sizes), resnet idk

github-actions · 2024-03-20T18:14:22Z

This branch currently is behind tinygrad/master. The line count difference bot is disabled.

This reverts commit c5bf9e4.

This reverts commit 8cb5215.

uuuvn marked this pull request as draft January 4, 2024 01:10

uuuvn marked this pull request as ready for review January 4, 2024 06:40

uuuvn force-pushed the ring_allreduce branch from b7e0619 to e932734 Compare January 4, 2024 21:36

geohot added the bounty locked Bounty is locked to someone label Jan 4, 2024

uuuvn marked this pull request as draft January 5, 2024 02:59

uuuvn force-pushed the ring_allreduce branch 4 times, most recently from cd8e4f6 to d47ec27 Compare January 9, 2024 01:53

uuuvn marked this pull request as ready for review January 9, 2024 09:57

uuuvn force-pushed the ring_allreduce branch from 37eedfc to a5fb57b Compare January 9, 2024 10:41

uuuvn added 4 commits March 10, 2024 18:34

Replace ScheduleBarrier with automatic optimization

fc84ed8

unused import

30dcdee

fix comment

a1f1c66

typing

77b6568

chenyuxyz added this to the llama2 70B FP16 10 tok/s milestone Mar 10, 2024

Merge branch 'master' into ring_allreduce

d66fbb6

nimlgen reviewed Mar 11, 2024

View reviewed changes

tinygrad/realize.py Outdated Show resolved Hide resolved

uuuvn added 2 commits March 16, 2024 09:57

Merge branch 'master' into ring_allreduce

442f669

better fallback

9b9c2ed

chenyuxyz and others added 3 commits March 16, 2024 19:01

Merge branch 'master' into ring_allreduce

478f97a

Merge branch 'master' into ring_allreduce

09acb59

Merge branch 'master' into ring_allreduce

5a73b04

Merge branch 'master' into ring_allreduce

d8deccd

chenyuxyz reviewed Mar 18, 2024

View reviewed changes

python 3.8

20d6490

geohot merged commit c5bf9e4 into tinygrad:master Mar 20, 2024
16 checks passed

geohot added a commit that referenced this pull request Mar 20, 2024

Revert "Ring allreduce in multitensor (#3000)"

c13ea31

This reverts commit c5bf9e4.

geohot mentioned this pull request Mar 20, 2024

Revert "Ring allreduce in multitensor" #3840

Merged

geohot added a commit that referenced this pull request Mar 20, 2024

Revert "Ring allreduce in multitensor (#3000)" (#3840)

8cb5215

This reverts commit c5bf9e4.

geohot added a commit that referenced this pull request Mar 21, 2024

Revert "Revert "Ring allreduce in multitensor (#3000)" (#3840)"

e4ac194

This reverts commit 8cb5215.

uuuvn deleted the ring_allreduce branch March 22, 2024 19:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ring allreduce in multitensor #3000

Ring allreduce in multitensor #3000

uuuvn commented Jan 4, 2024 •

edited

Loading

uuuvn commented Jan 4, 2024

uuuvn commented Jan 4, 2024

g1y5x3 commented Jan 4, 2024

uuuvn commented Jan 4, 2024 •

edited

Loading

uuuvn commented Jan 9, 2024

geohot commented Jan 9, 2024

uuuvn commented Jan 10, 2024

geohot commented Jan 10, 2024 •

edited

Loading

geohot commented Jan 10, 2024

uuuvn commented Jan 10, 2024

uuuvn commented Jan 10, 2024

uuuvn commented Jan 10, 2024

uuuvn commented Jan 10, 2024

uuuvn commented Jan 10, 2024

uuuvn commented Jan 10, 2024

geohot commented Jan 10, 2024 •

edited

Loading

uuuvn commented Jan 10, 2024

uuuvn commented Jan 10, 2024

geohot commented Jan 15, 2024 •

edited

Loading

chenyuxyz commented Mar 10, 2024

geohot commented Mar 10, 2024

nimlgen commented Mar 14, 2024

uuuvn commented Mar 16, 2024

uuuvn commented Mar 16, 2024

uuuvn commented Mar 16, 2024

chenyuxyz commented Mar 18, 2024

uuuvn commented Mar 18, 2024

chenyuxyz Mar 18, 2024

uuuvn Mar 18, 2024

github-actions bot commented Mar 20, 2024

Ring allreduce in multitensor #3000

Ring allreduce in multitensor #3000

Conversation

uuuvn commented Jan 4, 2024 • edited Loading

uuuvn commented Jan 4, 2024

uuuvn commented Jan 4, 2024

g1y5x3 commented Jan 4, 2024

uuuvn commented Jan 4, 2024 • edited Loading

uuuvn commented Jan 9, 2024

geohot commented Jan 9, 2024

uuuvn commented Jan 10, 2024

geohot commented Jan 10, 2024 • edited Loading

geohot commented Jan 10, 2024

uuuvn commented Jan 10, 2024

uuuvn commented Jan 10, 2024

uuuvn commented Jan 10, 2024

uuuvn commented Jan 10, 2024

uuuvn commented Jan 10, 2024

uuuvn commented Jan 10, 2024

geohot commented Jan 10, 2024 • edited Loading

uuuvn commented Jan 10, 2024

uuuvn commented Jan 10, 2024

geohot commented Jan 15, 2024 • edited Loading

chenyuxyz commented Mar 10, 2024

geohot commented Mar 10, 2024

nimlgen commented Mar 14, 2024

uuuvn commented Mar 16, 2024

uuuvn commented Mar 16, 2024

uuuvn commented Mar 16, 2024

chenyuxyz commented Mar 18, 2024

uuuvn commented Mar 18, 2024

chenyuxyz Mar 18, 2024

Choose a reason for hiding this comment

uuuvn Mar 18, 2024

Choose a reason for hiding this comment

github-actions bot commented Mar 20, 2024

uuuvn commented Jan 4, 2024 •

edited

Loading

uuuvn commented Jan 4, 2024 •

edited

Loading

geohot commented Jan 10, 2024 •

edited

Loading

geohot commented Jan 10, 2024 •

edited

Loading

geohot commented Jan 15, 2024 •

edited

Loading