[inductor] Add lowering and codegen for aten.sort #128458

peterbell10 · 2024-06-11T22:09:16Z

Stack from ghstack (oldest at bottom):

Benchmarks:

Shape	dim	stable	compiled	eager	speedup
(256, 4096)	0	False	0.73 ms	1.26 ms	1.7
(256, 4096)	0	True	0.75 ms	1.27 ms	1.7
(4096, 256)	1	False	0.20 ms	0.73 ms	3.7
(4096, 256)	1	True	0.21 ms	0.73 ms	3.5
(255, 4096)	0	False	1.05 ms	1.48 ms	1.4
(255, 4096)	0	True	1.03 ms	1.47 ms	1.4
(4096, 255)	1	False	0.52 ms	0.98 ms	1.9
(4096, 255)	1	True	0.54 ms	1.00 ms	1.9

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-06-11T22:09:19Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128458

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (3 Unrelated Failures)

As of commit d7643bd with merge base c888ee3 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-13) (gh) (similar failure)
test_mps.py::TestMPS::test_mps_allocator_module
trunk / macos-py3-arm64-mps / test (mps, 1, 1, macos-m1-14) (gh) (similar failure)
test_mps.py::TestMPS::test_mps_allocator_module

UNSTABLE - The following job failed but was likely due to flakiness present on trunk and has been marked as unstable:

inductor / rocm6.1-py3.8-inductor / test (inductor, 1, 1, linux.rocm.gpu.2, unstable) (gh) (#128871)
'test/inductor/test_max_autotune.py::TestTuningProcess::test_tuning_pool_multiple_devices'

This comment was automatically generated by Dr. CI and updates every 15 minutes.

[ghstack-poisoned]

ghstack-source-id: dccc6d29e991f3c340847808fde00a970f8be5c8 Pull Request resolved: #128458

[ghstack-poisoned]

ghstack-source-id: e73a00fed67aa344036a8152693a36ef71b6dda4 Pull Request resolved: #128458

[ghstack-poisoned]

ghstack-source-id: 859705512a7341dedf55477606723453d5f21ec0 Pull Request resolved: #128458

peterbell10 · 2024-06-21T00:52:14Z

torch/_inductor/ir.py

+ sort_numel = sizevars.simplify(sympy_product(sort_ranges))
+
+ # Heuristic, smallest rblock where triton usually outperforms aten.sort
+ max_rblock = 256


This only uses the lowering for rnumel <= 256 because above that it seems to be significantly slower than eager. It also isn't bandwidth bound so I'm not convinced fusion would save us here.

can you leave this as a comment in the code?

Even if this kernel isn't bandwidth bound the subsequent fusion might be.

peterbell10 · 2024-06-21T00:53:30Z

torch/_inductor/runtime/triton_helpers.py

+
+
+@triton.jit
+def sort_with_index(


These functions are adapted from the triton library function but with added stable sort and the addition of an rmask to support sorting on non-powers of two.

peterbell10 · 2024-06-21T00:56:32Z

torch/_inductor/codegen/simd.py

+
+ # ops.sort only works with persistent reduction, and is not bandwidth bound anyway
+ # so taking the hit of non-coalesced loads is okay
+ has_sort = any(_node_has_sort(node) for node in node_schedule)


This is a bit of a work-around. For outer reductions we don't normally use RBLOCK >= 64 but for sort since we can't loop over we really need the longer RBLOCK as we don't have a non-persistent version of the sort. So, I just add an exception if I find ops.sort in the kernel.

lezcano

Mind adding some benchmarks to the OP?

The implementation LGTM

lezcano · 2024-06-21T08:40:35Z

torch/_inductor/codegen/simd.py

+ return bool(sort_nodes)
+
+ # ops.sort only works with persistent reduction, and is not bandwidth bound anyway
+ # so taking the hit of non-coalesced loads is okay


is what ways are coalesced loads related to whether a kernel is persistent or not?

For outer reductions we can use e.g. XBLOCK=32, RBLOCK=32 which requires a non-persistent reduction for rnumel > 32 but allows the loads to be coalesced in the x dimension.

torch/_inductor/codegen/triton.py

lezcano · 2024-06-21T09:18:27Z

torch/_inductor/ir.py

+ sort_numel = sizevars.simplify(sympy_product(sort_ranges))
+
+ # Heuristic, smallest rblock where triton usually outperforms aten.sort
+ max_rblock = 256


can you leave this as a comment in the code?

torch/_inductor/ir.py

lezcano · 2024-06-21T09:24:11Z

torch/_inductor/ops_handler.py

@@ -744,6 +756,10 @@ def frexp(x) -> Tuple[None, None]:
 def scan(dtypes, combine_fn, values) -> Tuple[None, ...]:
 return tuple(None for i in range(len(values)))

+ @staticmethod
+ def sort(dtypes, values, stable, descending) -> Tuple[None, ...]:
+ return tuple(None for i in range(len(values)))


nit. and it follows the pattern above, but we can simply do (None,) * len(values)

lezcano · 2024-06-21T09:30:04Z

torch/_inductor/runtime/triton_helpers.py

+ ileft = tl.broadcast_to(tl.sum(iy * left_mask, 1)[:, None, :], shape)
+ iright = tl.broadcast_to(tl.sum(iy * right_mask, 1)[:, None, :], shape)


We use this trick all over the place because triton's reduce does not work for non-commutative operators. Otherwise, we could do a reduce with lambda a, b: b. If you think it'd be beneficial (you say this kernel is not bandwidth?), I could send a PR to triton fixing this.

Feel free to give it a shot

torch/_inductor/runtime/triton_helpers.py

[ghstack-poisoned]

ghstack-source-id: 559ed4e005ee962dbad2b0824f27d4935afe2b19 Pull Request resolved: #128458

eellison · 2024-06-24T21:24:41Z

torch/_inductor/lowering.py

+ descending=descending,
+ )
+ if values is None:
+ return sort_fallback(x, stable=stable, dim=dim, descending=descending)


should also fallback if using halide codegen

eellison · 2024-06-24T21:35:41Z

torch/_inductor/ir.py

+ sort_numel = sizevars.simplify(sympy_product(sort_ranges))
+
+ # Heuristic, smallest rblock where triton usually outperforms aten.sort
+ max_rblock = 256


Even if this kernel isn't bandwidth bound the subsequent fusion might be.

eellison · 2024-06-24T21:47:10Z

torch/_inductor/ir.py

+
+ # Heuristic, smallest rblock where triton usually outperforms aten.sort
+ # It also isn't bandwidth bound so fusion is unlikely to help.
+ max_rblock = 256


Previously you had suggested 1024 as max. Would you mind posting the numbers for max_rblock > 256, maybe with and without cheap pointwise fusion

Sure, here are some speedup numbers for

torch.sort(a * a, dim=-1)[0] + 2

at different shapes. So we have fusions before and after the sort, as well as being able to eliminate the index buffer to give the triton kernel its best shot:

1024, 255: 2.1x
1024, 256: 6.0x
1024, 257: 0.87x
1024, 511: 1.0x
1024, 512: 2.2x
1024, 513: 0.35x
1024, 1023: 0.36x
1024, 1024: 0.93x

You can see that it's marginally, but noticeably worse between 257-511 and only the exact value of 512 gains any speedup (where the mask is removed). Compared to 256

eellison · 2024-06-24T21:52:23Z

test/inductor/test_torchinductor.py

+ self.common(fn, (inp, False))
+ self.common(fn, (inp, True))
+
+ def test_sort_stable(self):


maybe add some tests with rblock constrained <= 256 but still dynamic ?

[ghstack-poisoned]

eellison · 2024-06-25T01:47:11Z

torch/_inductor/runtime/triton_helpers.py

+
+
+@triton.jit
+def sort_with_index(


do we optimize out the index global write when it's not needed ? and if so, it still might be possible that triton isn't smart even to remove all the index related intermediaries

The write would be removed by inductor and triton can probably DCE the final steps of the index computation, but I do expect we could get a bigger win by having custom codegen for for sort without index. Though that wouldn't be able to support stable sorting since I'm using the indices to get a stable sort from the bitonic sort algorithm which is naturally unstable.

Actually benchmarking it I see without indices is almost 2x faster so I guess triton is able to remove the index reductions entirely. That's pretty cool actually.

eellison · 2024-06-25T01:52:08Z

torch/_inductor/runtime/triton_helpers.py

+ tl.static_assert(
+ _dim == len(x.shape) - 1, "only minor dimension is currently supported"
+ )
+ # iteratively run bitonic merge-sort steps


Could we pad the masked values with inf/-inf here depending on descending instead of doing all the mask work in compare_and_swap_with_index ?

Maybe that would give better perf for masked inputs ?

That has the issue that inf might appear in the input and so we might sort the padding into the output. That said, it would work if either:

we don't care about the indices

we're doing a stable sort and so the in-sequence infs would get sorted earlier than the padding.

Yea, not blocking for this land but maybe file issue after you land ?

From original issue 5/6 of the sorts either did not use indices or used stable sort https://gist.github.com/eellison/36f0c3e6025360315dd67932461ff11b

For index case could use negative number for padded index and use that as part of tie breaker

Opened #129507

[ghstack-poisoned]

ghstack-source-id: 1809983d39ae9039fc3f83b4b6738728c47861c4 Pull Request resolved: #128458

peterbell10 · 2024-06-25T21:04:20Z

@pytorchbot merge

pytorchmergebot · 2024-06-25T21:05:52Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

ghstack-source-id: 1809983d39ae9039fc3f83b4b6738728c47861c4 Pull Request resolved: pytorch#128458

Update

a2c4d38

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Jun 11, 2024

pytorchbot added the open source label Jun 11, 2024

Update

7fd2a1b

[ghstack-poisoned]

peterbell10 mentioned this pull request Jun 12, 2024

[inductor] Improve superfluous mask handling in triton codegen #128518

Closed

Update

4479916

[ghstack-poisoned]

Update

c69525a

[ghstack-poisoned]

Update

1916060

[ghstack-poisoned]

peterbell10 added a commit that referenced this pull request Jun 13, 2024

[inductor] Add lowering and codegen for aten.sort

0e987a3

ghstack-source-id: dccc6d29e991f3c340847808fde00a970f8be5c8 Pull Request resolved: #128458

Update

0a55134

[ghstack-poisoned]

peterbell10 added a commit that referenced this pull request Jun 13, 2024

[inductor] Add lowering and codegen for aten.sort

2d9ceaa

ghstack-source-id: e73a00fed67aa344036a8152693a36ef71b6dda4 Pull Request resolved: #128458

Update

410b0f5

[ghstack-poisoned]

peterbell10 added a commit that referenced this pull request Jun 14, 2024

[inductor] Add lowering and codegen for aten.sort

b253882

ghstack-source-id: 859705512a7341dedf55477606723453d5f21ec0 Pull Request resolved: #128458

peterbell10 marked this pull request as ready for review June 21, 2024 00:49

peterbell10 commented Jun 21, 2024

View reviewed changes

peterbell10 requested review from lezcano and eellison June 21, 2024 00:57

lezcano approved these changes Jun 21, 2024

View reviewed changes

Update

f400562

[ghstack-poisoned]

peterbell10 added a commit that referenced this pull request Jun 21, 2024

[inductor] Add lowering and codegen for aten.sort

7a05a80

ghstack-source-id: 559ed4e005ee962dbad2b0824f27d4935afe2b19 Pull Request resolved: #128458

eellison approved these changes Jun 24, 2024

View reviewed changes

Update

c473642

[ghstack-poisoned]

eellison reviewed Jun 25, 2024

View reviewed changes

Update

2bc9eec

[ghstack-poisoned]

Fix merge conflict

d7643bd

[ghstack-poisoned]

peterbell10 added a commit that referenced this pull request Jun 25, 2024

[inductor] Add lowering and codegen for aten.sort

38649a6

ghstack-source-id: 1809983d39ae9039fc3f83b4b6738728c47861c4 Pull Request resolved: #128458

peterbell10 added the release notes: inductor label Jun 25, 2024

peterbell10 mentioned this pull request Jun 25, 2024

Investigate alternatives to remove mask from triton sort kernel #129507

Closed

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 25, 2024

pytorchmergebot added the merging label Jun 25, 2024

peterbell10 mentioned this pull request Jun 25, 2024

[ATen] Make argsort.stable CompositeImplicitAutograd #129529

Closed

pytorchmergebot added the Merged label Jun 26, 2024

pytorchmergebot closed this in 90d5a6f Jun 26, 2024

pytorchmergebot removed the merging label Jun 26, 2024

peterbell10 added a commit to peterbell10/pytorch that referenced this pull request Jun 26, 2024

[inductor] Add lowering and codegen for aten.sort

264d1dc

ghstack-source-id: 1809983d39ae9039fc3f83b4b6738728c47861c4 Pull Request resolved: pytorch#128458

github-actions bot deleted the gh/peterbell10/742/head branch July 26, 2024 01:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[inductor] Add lowering and codegen for aten.sort #128458

[inductor] Add lowering and codegen for aten.sort #128458

peterbell10 commented Jun 11, 2024 •

edited

Loading

pytorch-bot bot commented Jun 11, 2024 •

edited

Loading

peterbell10 Jun 21, 2024

lezcano Jun 21, 2024

eellison Jun 24, 2024

peterbell10 Jun 21, 2024

peterbell10 Jun 21, 2024

lezcano left a comment

lezcano Jun 21, 2024

peterbell10 Jun 21, 2024

lezcano Jun 21, 2024

lezcano Jun 21, 2024

lezcano Jun 21, 2024

peterbell10 Jun 21, 2024

eellison Jun 24, 2024

eellison Jun 24, 2024

eellison Jun 24, 2024

peterbell10 Jun 24, 2024 •

edited

Loading

eellison Jun 24, 2024

eellison Jun 25, 2024

peterbell10 Jun 25, 2024

peterbell10 Jun 25, 2024

eellison Jun 25, 2024

peterbell10 Jun 25, 2024

eellison Jun 25, 2024

eellison Jun 25, 2024 •

edited

Loading

peterbell10 Jun 25, 2024

peterbell10 commented Jun 25, 2024

pytorchmergebot commented Jun 25, 2024

		ileft = tl.broadcast_to(tl.sum(iy * left_mask, 1)[:, None, :], shape)
		iright = tl.broadcast_to(tl.sum(iy * right_mask, 1)[:, None, :], shape)

[inductor] Add lowering and codegen for aten.sort #128458

[inductor] Add lowering and codegen for aten.sort #128458

Conversation

peterbell10 commented Jun 11, 2024 • edited Loading

pytorch-bot bot commented Jun 11, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128458

✅ You can merge normally! (3 Unrelated Failures)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lezcano left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterbell10 Jun 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eellison Jun 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

peterbell10 commented Jun 25, 2024

pytorchmergebot commented Jun 25, 2024

Merge started

peterbell10 commented Jun 11, 2024 •

edited

Loading

pytorch-bot bot commented Jun 11, 2024 •

edited

Loading

peterbell10 Jun 24, 2024 •

edited

Loading

eellison Jun 25, 2024 •

edited

Loading