Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Transform API] Simple tiling can be tedious to implement #666

Open
kaushikcfd opened this issue Aug 9, 2022 · 3 comments
Open

[Transform API] Simple tiling can be tedious to implement #666

kaushikcfd opened this issue Aug 9, 2022 · 3 comments

Comments

@kaushikcfd
Copy link
Collaborator

Consider the simple batched matvec example:

knl = lp.make_kernel(
    "{[e,i,j]: 0<=e<1000 and 0<=i,j<10}",
    """
    out[e, i] = sum(j, D[i, j] * u[e, j])
    """,
    [lp.GlobalArg("u,D,out", "float64", lp.auto)])

# Tile "j"+prefetch to reduce the cache reuse distance of "D"
knl = lp.split_iname(knl, "j", 5)
knl = lp.add_prefetch(knl, "D", sweep_inames=["i", "j_inner"])

print(lp.generate_code_v2(knl).device_code())

This results in a linearization error with:

* Duplicate j_outer within instructions (id:insn_j_outer_j_inner_update)
* Duplicate i within instructions (id:insn_j_outer_j_inner_init or id:insn)

Notice how the duplication options force us to realize the reduction. Making this simple transformation quite tedious to implement. We should have some interface to make this transformation easier as (at least in the context of einsums) this is a pretty common transformation.

@kaushikcfd
Copy link
Collaborator Author

kaushikcfd commented Aug 9, 2022

Some options could be:

  1. a kernel could have an attribute post_realize_reduction_tansforms_callback
  2. making reduction nodes taggable and defining some implementation tags that would help in this case.

@inducer
Copy link
Owner

inducer commented Aug 18, 2022

For posterity, the resulting kernel is this:

  for D_dim_0, D_dim_1, j_outer
↱       D_fetch[D_dim_0, D_dim_1] = D[D_dim_0, D_dim_1 + 5*j_outer]  {id=D_fetch_rule}
│ end D_dim_0, D_dim_1, j_outer
│ for i, e
└     out[e, i] = reduce(sum, [j_outer, j_inner], D_fetch[i, j_inner]*u[e, j_inner + j_outer*5])  {id=insn}
  end i, e

and after reduction realization, it becomes

    for j_outer, D_dim_0, D_dim_1
↱         D_fetch[D_dim_0, D_dim_1] = D[D_dim_0, D_dim_1 + 5*j_outer]  {id=D_fetch_rule}
│   end j_outer, D_dim_0, D_dim_1
│   for e, i
│↱      acc_j_outer_j_inner = 0.0  {id=insn_j_outer_j_inner_init}
││      for j_inner, j_outer
├└↱         acc_j_outer_j_inner = acc_j_outer_j_inner + D_fetch[i, j_inner]*u[e, j_inner + j_outer*5]  {id=insn_j_outer_j_inner_update}
│ │     end j_inner, j_outer
└ └     out[e, i] = acc_j_outer_j_inner  {id=insn}
    end e, i

I guess my first question here is what you would like to happen. Duplicating j_outer will lead to invalid code, I think. FWIW, I get different duplication options:

* Duplicate j_outer within instructions (id:D_fetch_rule)
* Duplicate j_outer within instructions (id:insn_j_outer_j_inner_update)
* Duplicate e within instructions (id:insn_j_outer_j_inner_init or id:insn)

@kaushikcfd
Copy link
Collaborator Author

I would first privatize the temporary acc_j_outer_j_inner in the iname i and duplicate i in the instructions insn and insn_j_outer_j_inner_init. This way the accumulator's state is stored as we perform the computation of a tile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants