[Transform API] Simple tiling can be tedious to implement #666

kaushikcfd · 2022-08-09T23:45:54Z

Consider the simple batched matvec example:

knl = lp.make_kernel(
    "{[e,i,j]: 0<=e<1000 and 0<=i,j<10}",
    """
    out[e, i] = sum(j, D[i, j] * u[e, j])
    """,
    [lp.GlobalArg("u,D,out", "float64", lp.auto)])

# Tile "j"+prefetch to reduce the cache reuse distance of "D"
knl = lp.split_iname(knl, "j", 5)
knl = lp.add_prefetch(knl, "D", sweep_inames=["i", "j_inner"])

print(lp.generate_code_v2(knl).device_code())

This results in a linearization error with:

* Duplicate j_outer within instructions (id:insn_j_outer_j_inner_update)
* Duplicate i within instructions (id:insn_j_outer_j_inner_init or id:insn)

Notice how the duplication options force us to realize the reduction. Making this simple transformation quite tedious to implement. We should have some interface to make this transformation easier as (at least in the context of einsums) this is a pretty common transformation.

The text was updated successfully, but these errors were encountered:

kaushikcfd · 2022-08-09T23:54:07Z

Some options could be:

a kernel could have an attribute post_realize_reduction_tansforms_callback
making reduction nodes taggable and defining some implementation tags that would help in this case.

inducer · 2022-08-18T19:29:36Z

For posterity, the resulting kernel is this:

  for D_dim_0, D_dim_1, j_outer
↱       D_fetch[D_dim_0, D_dim_1] = D[D_dim_0, D_dim_1 + 5*j_outer]  {id=D_fetch_rule}
│ end D_dim_0, D_dim_1, j_outer
│ for i, e
└     out[e, i] = reduce(sum, [j_outer, j_inner], D_fetch[i, j_inner]*u[e, j_inner + j_outer*5])  {id=insn}
  end i, e

and after reduction realization, it becomes

    for j_outer, D_dim_0, D_dim_1
↱         D_fetch[D_dim_0, D_dim_1] = D[D_dim_0, D_dim_1 + 5*j_outer]  {id=D_fetch_rule}
│   end j_outer, D_dim_0, D_dim_1
│   for e, i
│↱      acc_j_outer_j_inner = 0.0  {id=insn_j_outer_j_inner_init}
││      for j_inner, j_outer
├└↱         acc_j_outer_j_inner = acc_j_outer_j_inner + D_fetch[i, j_inner]*u[e, j_inner + j_outer*5]  {id=insn_j_outer_j_inner_update}
│ │     end j_inner, j_outer
└ └     out[e, i] = acc_j_outer_j_inner  {id=insn}
    end e, i

I guess my first question here is what you would like to happen. Duplicating j_outer will lead to invalid code, I think. FWIW, I get different duplication options:

* Duplicate j_outer within instructions (id:D_fetch_rule)
* Duplicate j_outer within instructions (id:insn_j_outer_j_inner_update)
* Duplicate e within instructions (id:insn_j_outer_j_inner_init or id:insn)

kaushikcfd · 2022-08-18T19:35:29Z

I would first privatize the temporary acc_j_outer_j_inner in the iname i and duplicate i in the instructions insn and insn_j_outer_j_inner_init. This way the accumulator's state is stored as we perform the computation of a tile.

inducer added the enhancement label Aug 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Transform API] Simple tiling can be tedious to implement #666

[Transform API] Simple tiling can be tedious to implement #666

kaushikcfd commented Aug 9, 2022

kaushikcfd commented Aug 9, 2022 •

edited

Loading

inducer commented Aug 18, 2022

kaushikcfd commented Aug 18, 2022

[Transform API] Simple tiling can be tedious to implement #666

[Transform API] Simple tiling can be tedious to implement #666

Comments

kaushikcfd commented Aug 9, 2022

kaushikcfd commented Aug 9, 2022 • edited Loading

inducer commented Aug 18, 2022

kaushikcfd commented Aug 18, 2022

kaushikcfd commented Aug 9, 2022 •

edited

Loading