Add simple fused Triton kernel benchmark for jagged_mean operator #2355

jananisriram · 2024-07-02T21:54:56Z

Summary:
Add Triton kernel benchmark implementing a simple fused mean for the jagged_mean operator. The Triton kernels perform a mean along the ragged dimension of a nested tensor of logical dimensions (B, *, M), where * is the ragged dimension. They load in blocks of the values tensor along its last dimension M, reduce each block of variable length along its first dimension *, and store each of B reductions in an output tensor of shape (B, M). The first kernel, sum_then_buffer, performs a sum on each block of input, then accumulates into a buffer. The second kernel, buffer_then_sum, is a faster implementation which accumulates blocks into a buffer, then performs a cumulative sum.

This diff is particularly useful in emulating the loop in Inductor-generated (torch.compile) kernels and serves as a benchmark proxy for Inductor kernels.

Use the command-line argument sum_then_buffer, defaulted to 0 (as buffer_then_sum is faster, shown below), to decide which Triton kernel to benchmark.

These Triton kernels are benchmarked against two PyTorch implementations, one of which uses torch.mean, and the other torch.div, torch.sum, and shape.

This diff follows the general framework found in the jagged_sum operator (D58549297, D59034792).

Reviewed By: davidberard98

Differential Revision: D59146627

facebook-github-bot · 2024-07-02T21:55:19Z

This pull request was exported from Phabricator. Differential Revision: D59146627

Summary: Add to TritonBench a `jagged_mean` reduction operator for nested tensors using the PyTorch `torch.mean` and `unbind` functions. This diff implements a basic benchmark for reducing along the ragged dimension of 3-dimensional jagged tensors. For a 3-dimensional tensor of shape `(B, *, M)`, where `*` is the ragged dimension, this benchmark uses PyTorch's `mean` operator to reduce `B` `(*, M)` 2-dimensional tensors to a `(B, M)` output tensor. Add plotting functionality to the `jagged_mean` operator in TritonBench, enabling the creation of line plots for any set of benchmarks variable along one of the following input parameters: `B`, `M`, `seqlen`, or `sparsity`. This diff sets the groundwork to visualize the differences in `latency` among the different benchmarks in the `jagged_mean` operator. Measure performance of basic PyTorch benchmark using the `latency` and `gbps` metrics as well as the `latency` plot, variable along one input parameter. Display nested tensor parameters in benchmark output. This diff follows the general framework found in the `jagged_sum` operator (D58396957, D59034792). Differential Revision: D59144906 Reviewed By: davidberard98

Summary: Add to TritonBench a `jagged_mean` reduction operator benchmark for nested tensors using the PyTorch `torch.sum`, `torch.div`, `torch.shape`, and `unbind` functions. This diff implements a basic benchmark for reducing along the ragged dimension of 3-dimensional jagged tensors. For a 3-dimensional tensor of shape `(B, *, M)`, where `*` is the ragged dimension, this benchmark `unbind`s the nested tensor into `B` `(*, M)` 2-dimensional tensors. For each `(*, M)` tensor, the benchmark divides the `sum` along the ragged dimension `0` by the `shape` along the ragged dimension `0`, which calculates the `mean` for `(*, M)`. Extend plotting functionality for the `jagged_mean` operator to account for the new benchmark. Add an `accuracy` metric to verify that the results of all existing benchmarks match. This diff follows the general framework found in the `jagged_sum` operator (D58396957, D59034792). Differential Revision: D59146024

Differential Revision: D59245842

facebook-github-bot · 2024-07-02T22:02:51Z

This pull request was exported from Phabricator. Differential Revision: D59146627

…torch#2355) Summary: Pull Request resolved: pytorch#2355 Add Triton kernel benchmark implementing a simple fused `mean` for the `jagged_mean` operator. The Triton kernels perform a `mean` along the ragged dimension of a nested tensor of logical dimensions `(B, *, M)`, where `*` is the ragged dimension. They load in blocks of the values tensor along its last dimension `M`, reduce each block of variable length along its first dimension `*`, and store each of `B` reductions in an output tensor of shape `(B, M)`. The first kernel, `sum_then_buffer`, performs a `sum` on each block of input, then accumulates into a buffer. The second kernel, `buffer_then_sum`, is a faster implementation which accumulates blocks into a buffer, then performs a cumulative `sum`. This diff is particularly useful in emulating the loop in Inductor-generated (`torch.compile`) kernels and serves as a benchmark proxy for Inductor kernels. Use the command-line argument `sum_then_buffer`, defaulted to `0` (as `buffer_then_sum` is faster, shown below), to decide which Triton kernel to benchmark. These Triton kernels are benchmarked against two PyTorch implementations, one of which uses `torch.mean`, and the other `torch.div`, `torch.sum`, and `shape`. This diff follows the general framework found in the jagged_sum operator (D58549297, D59034792). Reviewed By: davidberard98 Differential Revision: D59146627

facebook-github-bot · 2024-07-02T22:07:49Z

This pull request was exported from Phabricator. Differential Revision: D59146627

facebook-github-bot · 2024-07-03T01:46:24Z

This pull request has been merged in 1e79c04.

facebook-github-bot added the cla signed label Jul 2, 2024

facebook-github-bot added the fb-exported label Jul 2, 2024

jananisriram had a problem deploying to docker-s3-upload July 2, 2024 21:56 — with GitHub Actions Error

jananisriram added 3 commits July 2, 2024 14:58

Add padded, torch.sum benchmark for jagged_mean operator

58c612f

Differential Revision: D59245842

jananisriram force-pushed the export-D59146627 branch from a2d8de3 to aa6ceb9 Compare July 2, 2024 22:02

jananisriram had a problem deploying to docker-s3-upload July 2, 2024 22:04 — with GitHub Actions Error

jananisriram force-pushed the export-D59146627 branch from aa6ceb9 to f69ee0a Compare July 2, 2024 22:07

jananisriram temporarily deployed to docker-s3-upload July 2, 2024 22:09 — with GitHub Actions Inactive

facebook-github-bot closed this in 1e79c04 Jul 3, 2024

facebook-github-bot added the Merged label Jul 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add simple fused Triton kernel benchmark for jagged_mean operator #2355

Add simple fused Triton kernel benchmark for jagged_mean operator #2355

jananisriram commented Jul 2, 2024

facebook-github-bot commented Jul 2, 2024

facebook-github-bot commented Jul 2, 2024

facebook-github-bot commented Jul 2, 2024

facebook-github-bot commented Jul 3, 2024

Add simple fused Triton kernel benchmark for jagged_mean operator #2355

Add simple fused Triton kernel benchmark for jagged_mean operator #2355

Conversation

jananisriram commented Jul 2, 2024

facebook-github-bot commented Jul 2, 2024

facebook-github-bot commented Jul 2, 2024

facebook-github-bot commented Jul 2, 2024

facebook-github-bot commented Jul 3, 2024