Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thoughts on making wasm2c thread-safe #1805

Closed
yhdengh opened this issue Jan 13, 2022 · 4 comments
Closed

Thoughts on making wasm2c thread-safe #1805

yhdengh opened this issue Jan 13, 2022 · 4 comments

Comments

@yhdengh
Copy link
Contributor

yhdengh commented Jan 13, 2022

Hello all,

I would like to start a conversation about the best way to make wasm2c thread-safe so that the same module can be instantiated multiple times with different memory contexts, and functions from different instances can be executed in parallel. Our current plan was to move global memory variables into a structure and then pass a pointer to that structure as a parameter to every function call (Similar changes are also proposed by #1721). Before we draft a PR, I would like to see what are people’s thoughts about this. More details are below.

Since the changes we make introduce a new layer of abstraction, we were concerned about the potential increase of memory access latency. Therefore, we investigated the assembly code generated by the output of status quo wasm2c and a toy model of the expected output of our new version of wasm2c in terms of memory access. The assembly code is compiled with clang -O2. Both versions require 2 memory access for a memory store.

Status quo wasm2c

Input wat file:

(module
  (memory $m1 1)
  (func (export "store") (param i32 i32)
     local.get 0
     local.get 1
     i32.store))

Assembly code compiled from wasm2c output:

0000000000401240 <w2c_store>:
static void w2c_store(u32 w2c_p0, u32 w2c_p1) {
  401240: 50                    push   %rax
  FUNC_PROLOGUE;
  401241: 8b 05 81 2e 00 00     mov    0x2e81(%rip),%eax        # 4040c8 <wasm_rt_call_stack_depth>
  401247: 83 c0 01              add    $0x1,%eax
  40124a: 89 05 78 2e 00 00     mov    %eax,0x2e78(%rip)        # 4040c8 <wasm_rt_call_stack_depth>
  401250: 3d f5 01 00 00        cmp    $0x1f5,%eax
  401255: 73 15                 jae    40126c <w2c_store+0x2c>
  i32_store((&w2c_M0), (u64)(w2c_i0), w2c_i1);
  401257: 89 f8                 mov    %edi,%eax
DEFINE_STORE(i32_store, u32, u32);
  401259: 48 8b 0d 40 2e 00 00  mov    0x2e40(%rip),%rcx        # 4040a0 <w2c_M0>
  401260: 89 34 01              mov    %esi,(%rcx,%rax,1)
  FUNC_EPILOGUE;
  401263: 83 05 5e 2e 00 00 ff  addl   $0xffffffff,0x2e5e(%rip)        # 4040c8 <wasm_rt_call_stack_depth>
}
  40126a: 58                    pop    %rax
  40126b: c3                    ret
  FUNC_PROLOGUE;
  40126c: bf 07 00 00 00        mov    $0x7,%edi
  401271: e8 0a 00 00 00        call   401280 <wasm_rt_trap>
  401276: 66 2e 0f 1f 84 00 00  cs nopw 0x0(%rax,%rax,1)
  40127d: 00 00 00

There are two memory access for the actual i32_store, 401259: 48 8b 0d 40 2e 00 00 mov 0x2e40(%rip),%rcx for storing the address of w2c_M0->data to a register and 401260: 89 34 01 mov %esi,(%rcx,%rax,1) for the actual storing.

Toy model of the expected output after our changes

program.cc

#include <stdint.h>

typedef struct {
  uint8_t* data;
  uint32_t size;
} wasm_rt_memory_t;

typedef struct {
  wasm_rt_memory_t memoryone;
  wasm_rt_memory_t memorytwo;
} memories;

void i8_store(memories* mem, uint8_t val, uint32_t idx);
uint8_t i8_load(memories* mem, uint32_t idx);
int start(memories* mem);

void i8_store(memories* mem, uint8_t val, uint32_t idx) {
  mem->memorytwo.data[idx] = val;
  return;
}

uint8_t i8_load(memories* mem, uint32_t idx) {
  return mem->memorytwo.data[idx];
}

int start(memories* mem) {
  i8_store(mem, 42, 8);
  return i8_load(mem, 8);
}

main.cc

#include <stdint.h>

typedef struct {
  uint8_t* data;
  uint32_t size;
} wasm_rt_memory_t;

typedef struct {
  wasm_rt_memory_t memoryone;
  wasm_rt_memory_t memorytwo;
} memories;

static memories global_mem;

int main() {
  return start(&global_mem);
}

and the compiled assembly code:

void i8_store(memories* mem, uint8_t val, uint32_t idx) {
  mem->memorytwo.data[idx] = val;
   0: 48 8b 47 10           mov    0x10(%rdi),%rax
   4: 89 d1                 mov    %edx,%ecx
   6: 40 88 34 08           mov    %sil,(%rax,%rcx,1)
  return;
   a: c3                    ret
   b: 0f 1f 44 00 00        nopl   0x0(%rax,%rax,1)

There are also two memory access, 0: 48 8b 47 10 mov 0x10(%rdi),%rax for getting mem->memorytwo.data and 6: 40 88 34 08 mov %sil,(%rax,%rcx,1) for the actual store.

Both the status quo and our version of wasm2c has the same memory access overhead on the assembly code level. It would be nice if someone could share their thoughts on the memory access pattern of wasm2c and also the changes we are planning to make.

@kripken
Copy link
Member

kripken commented Jan 20, 2022

I believe that the "per-instance" part of #1721 should cover this. We expect to see separate PRs for those, including one for instancing.

cc @shravanrn

@kripken
Copy link
Member

kripken commented Jan 20, 2022

Oh, actually I see you linked to #1721 already but with a @ and not # so I missed it, sorry...

About the overhead, I am not very worried. I'd expect an LTO build to be able to recover any losses. I could be wrong though...

@shravanrn
Copy link
Collaborator

Hi,
Yup, #1721 should cover multiple instances. Apologies for the delay in landing that PR. The upside is that PR has now been production tested for over a month across windows, Mac, Linux and android in the Firefox browser and works well.

I am hoping to make more progress landing that commit early February and will keep you posted.

Regarding your question re multiple memory accesses. I agree with alon that this isn't a huge issue as this extra load to the the "data" field wouldn't happen every time --- at most a few times per function. So net performance difference wouldn't be that big.

More concretely, this is a standard problem with sandboxing. (You may already be familiar with this, so I apologize if this is a repeat). In short, the only way to avoid the multiple memory safe accesses for an AOT multiple instance sandboxing compiler is to reserve a register which exclusively holds the value for data (the register cannot be used for any other computation).

The expectation is that pinned registers would buy a few percent of speedup in code with a lot of memory accesses(I think cranelift/wasmtime has some benchmarks in their compiler that show this), but this change would slow down CPU bottlenecked code due to the added register pressure. So depending on computation, this may or may not be something you'd want.

More to the point for wasm2c, there is no C compliant way to pin registers (the register keyword in C does not do the right thing for this). You'd have to mess with the core of the compiler to get the right behavior for pinned registers.

Given these factors, my own take is that the current wasm2c approach is the right tradeoff.

@yhdengh
Copy link
Contributor Author

yhdengh commented Jan 26, 2022

It sounds like we're on the same page. PR submitted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants