Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid moving REG_SP (make it a base pointer) #546

Open
jhawthorn opened this issue Aug 11, 2021 · 5 comments
Open

Avoid moving REG_SP (make it a base pointer) #546

jhawthorn opened this issue Aug 11, 2021 · 5 comments

Comments

@jhawthorn
Copy link

jhawthorn commented Aug 11, 2021

This is an idea to improve how we handle stack addresses to make codegen simpler to write and possibly improve performance.

Currently, REG_SP always has the same value of reg_cfp->sp, but to avoid unnecessary register/memory reads/writes we store an sp_offset and build our memory operands (ctx_sp_opnd) using that offset. When we call into C code, call another method, or return to the interpreter, we call jit_save_sp, which updates both REG_SP and REG_CFP->sp.

This causes some awkwardness in our codegen, because any memory operands we build using ctx_sp_opnd become invalid when REG_SP is updated (since they are relative to REG_SP). This requires us to be careful about when these operands are created vs used.

For example in gen_opt_aref

            // About to change REG_SP which these operands depend on. Yikes.
            mov(cb, R8, recv_opnd);
            mov(cb, R9, idx_opnd);

            // Write sp to cfp->sp since rb_hash_aref might need to call #hash on the key
            jit_save_sp(jit, ctx);

            yjit_save_regs(cb);

            mov(cb, C_ARG_REGS[0], R8);
            mov(cb, C_ARG_REGS[1], R9);

I think it would be possible to avoid this issue by treating REG_SP as a base pointer of the stack rather than a moving stack pointer (should we also rename it?). Basically this means that REG_SP would not move throughout the iseq, it would keep the same value it had at the YJIT entry point. Updating REG_CFP->sp would be done in the same way in the same places, however we'd now use a temporary register instead of updating REG_SP (same as we do for jit_save_pc).

In addition to making codegen simpler to write, I believe this will reduce the number of block versions.

Currently there is this comment at the end of gen_send_cfunc:

    // Note: gen_send_iseq() jumps to the next instruction with ctx->sp_offset == 0
    // after the call, while this does not. This difference prevents
    // the two call types from sharing the same successor.

Calling an iseq ends up with an sp_offset of 0, but calling a cfunc ends up with an sp_offset of 1 (we must push the return value from the cfunc ourselves). If we instead use the base SP rather than a moving SP, the sp_offsets will be the same leaving both basic blocks, therefore have compatible ctx, and can jump into the same target.

Another potential advantage (less sure of this) is that by not moving the register we will avoid a data dependency in following stack operations, which may improve CPU pipelining by avoiding stalls.

@maximecb
Copy link

What you wrote makes sense. The main question I have is: can we count on the __bp__ field in rb_control_frame_struct always being present? Does MJIT currently make use of this?

Another potential detail is: I remember Alan talking about using the call-threaded handlers so we can fall back to the interpreter for a single instruction, and in that case, we might end up in a situation where we don't know what the SP is.

I think that, if we only use a BP, we don't care what the value of the SP is as much in YJIT. However, we probably would still need to always make sure that the SP is up to date whenever we exit JITted code. That's a bit of a bummer, because it means we probably would have to keep the sp_offset and we may have to write both SP and BP values to stack frames at any call or side-exit boundaries.

@XrXr any comments?

@XrXr
Copy link

XrXr commented Aug 12, 2021

Yeah I think this make sense. I feel confident relying on __bp__ since the interpreter checks it in leave.
https://github.com/ruby/ruby/blob/5534698b84c1ef1567ebb1e2d79fbe1a2a573a77/insns.def#L902

We shouldn't need to change __bp__ cause it's supposed to stay constant for the frame. For one-shot fallback to the interpreter, I needed to know the SP delta for the right sp_offset anyways, so that won't be problem.

A note for implementation, since sp_offset is only 16 bits, we need to take care to test for large stacks.

@jhawthorn
Copy link
Author

jhawthorn commented Aug 12, 2021

can we count on the __bp__ field in rb_control_frame_struct always being present? Does MJIT currently make use of this

Currently it's always present, but I don't think we should rely on it. https://github.com/Shopify/yjit/blob/main/vm_insnhelper.c#L2172-L2200 suggests it's likely to be removed, but I believe we'll be able to calculate BP as SP - stack_size.

I think that, if we only use a BP, we don't care what the value of the SP is as much in YJIT. However, we probably would still need to always make sure that the SP is up to date whenever we exit JITted code. That's a bit of a bummer, because it means we probably would have to keep the sp_offset and we may have to write both SP and BP values to stack frames at any call or side-exit boundaries.

We'll always have to write cfp->sp on side exits (as we already do today). I don't think we'll have to update __bp__ as it shouldn't move, but we may need to recalculate (or re-fetch it from cfp) if we re-enter the jitted code.

@maximecb
Copy link

So we'd keep sp_offset, compute the BP when entering JITted code, update the SP when leaving or doing a fallback, update BP on method calls.

I'm not sure what the impact on code size would be. I think it might remove some instructions in a few places, but also add a few more in other places since we'd be updating the SP still 🤔 May remove some data dependencies as you said.

@XrXr
Copy link

XrXr commented Aug 12, 2021

BP == SP on entry to a frame, so no calculation needed there. The interpreter is checking that when leaving, SP is the same as it were on entry to the frame. But we do need to make calculations or load from cfp when we leave from output code to output code. Actually, I think this extra calculation/load is why we moved away from keeping the stack size to keeping an offset originally.

The dependency thing is probably not going to make much of a perf difference since the limiting factor on stalls is probably memory access around the relatively quick sp calculation. The main win I see for making this change is making it easier to write codegen / develop the new backend.

@maximecb maximecb transferred this issue from Shopify/yjit Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants