Thoughts on making wasm2c thread-safe #1805

yhdengh · 2022-01-13T21:11:55Z

Hello all,

I would like to start a conversation about the best way to make wasm2c thread-safe so that the same module can be instantiated multiple times with different memory contexts, and functions from different instances can be executed in parallel. Our current plan was to move global memory variables into a structure and then pass a pointer to that structure as a parameter to every function call (Similar changes are also proposed by #1721). Before we draft a PR, I would like to see what are people’s thoughts about this. More details are below.

Since the changes we make introduce a new layer of abstraction, we were concerned about the potential increase of memory access latency. Therefore, we investigated the assembly code generated by the output of status quo wasm2c and a toy model of the expected output of our new version of wasm2c in terms of memory access. The assembly code is compiled with clang -O2. Both versions require 2 memory access for a memory store.

Status quo wasm2c

Input wat file:

(module
  (memory $m1 1)
  (func (export "store") (param i32 i32)
     local.get 0
     local.get 1
     i32.store))

Assembly code compiled from wasm2c output:

0000000000401240 <w2c_store>:
static void w2c_store(u32 w2c_p0, u32 w2c_p1) {
  401240: 50                    push   %rax
  FUNC_PROLOGUE;
  401241: 8b 05 81 2e 00 00     mov    0x2e81(%rip),%eax        # 4040c8 <wasm_rt_call_stack_depth>
  401247: 83 c0 01              add    $0x1,%eax
  40124a: 89 05 78 2e 00 00     mov    %eax,0x2e78(%rip)        # 4040c8 <wasm_rt_call_stack_depth>
  401250: 3d f5 01 00 00        cmp    $0x1f5,%eax
  401255: 73 15                 jae    40126c <w2c_store+0x2c>
  i32_store((&w2c_M0), (u64)(w2c_i0), w2c_i1);
  401257: 89 f8                 mov    %edi,%eax
DEFINE_STORE(i32_store, u32, u32);
  401259: 48 8b 0d 40 2e 00 00  mov    0x2e40(%rip),%rcx        # 4040a0 <w2c_M0>
  401260: 89 34 01              mov    %esi,(%rcx,%rax,1)
  FUNC_EPILOGUE;
  401263: 83 05 5e 2e 00 00 ff  addl   $0xffffffff,0x2e5e(%rip)        # 4040c8 <wasm_rt_call_stack_depth>
}
  40126a: 58                    pop    %rax
  40126b: c3                    ret
  FUNC_PROLOGUE;
  40126c: bf 07 00 00 00        mov    $0x7,%edi
  401271: e8 0a 00 00 00        call   401280 <wasm_rt_trap>
  401276: 66 2e 0f 1f 84 00 00  cs nopw 0x0(%rax,%rax,1)
  40127d: 00 00 00

There are two memory access for the actual i32_store, 401259: 48 8b 0d 40 2e 00 00 mov 0x2e40(%rip),%rcx for storing the address of w2c_M0->data to a register and 401260: 89 34 01 mov %esi,(%rcx,%rax,1) for the actual storing.

Toy model of the expected output after our changes

program.cc

#include <stdint.h>

typedef struct {
  uint8_t* data;
  uint32_t size;
} wasm_rt_memory_t;

typedef struct {
  wasm_rt_memory_t memoryone;
  wasm_rt_memory_t memorytwo;
} memories;

void i8_store(memories* mem, uint8_t val, uint32_t idx);
uint8_t i8_load(memories* mem, uint32_t idx);
int start(memories* mem);

void i8_store(memories* mem, uint8_t val, uint32_t idx) {
  mem->memorytwo.data[idx] = val;
  return;
}

uint8_t i8_load(memories* mem, uint32_t idx) {
  return mem->memorytwo.data[idx];
}

int start(memories* mem) {
  i8_store(mem, 42, 8);
  return i8_load(mem, 8);
}

main.cc

#include <stdint.h>

typedef struct {
  uint8_t* data;
  uint32_t size;
} wasm_rt_memory_t;

typedef struct {
  wasm_rt_memory_t memoryone;
  wasm_rt_memory_t memorytwo;
} memories;

static memories global_mem;

int main() {
  return start(&global_mem);
}

and the compiled assembly code:

void i8_store(memories* mem, uint8_t val, uint32_t idx) {
  mem->memorytwo.data[idx] = val;
   0: 48 8b 47 10           mov    0x10(%rdi),%rax
   4: 89 d1                 mov    %edx,%ecx
   6: 40 88 34 08           mov    %sil,(%rax,%rcx,1)
  return;
   a: c3                    ret
   b: 0f 1f 44 00 00        nopl   0x0(%rax,%rax,1)

There are also two memory access, 0: 48 8b 47 10 mov 0x10(%rdi),%rax for getting mem->memorytwo.data and 6: 40 88 34 08 mov %sil,(%rax,%rcx,1) for the actual store.

Both the status quo and our version of wasm2c has the same memory access overhead on the assembly code level. It would be nice if someone could share their thoughts on the memory access pattern of wasm2c and also the changes we are planning to make.

The text was updated successfully, but these errors were encountered:

kripken · 2022-01-20T22:27:56Z

I believe that the "per-instance" part of #1721 should cover this. We expect to see separate PRs for those, including one for instancing.

cc @shravanrn

kripken · 2022-01-20T22:30:16Z

Oh, actually I see you linked to #1721 already but with a @ and not # so I missed it, sorry...

About the overhead, I am not very worried. I'd expect an LTO build to be able to recover any losses. I could be wrong though...

shravanrn · 2022-01-21T05:25:33Z

Hi,
Yup, #1721 should cover multiple instances. Apologies for the delay in landing that PR. The upside is that PR has now been production tested for over a month across windows, Mac, Linux and android in the Firefox browser and works well.

I am hoping to make more progress landing that commit early February and will keep you posted.

Regarding your question re multiple memory accesses. I agree with alon that this isn't a huge issue as this extra load to the the "data" field wouldn't happen every time --- at most a few times per function. So net performance difference wouldn't be that big.

More concretely, this is a standard problem with sandboxing. (You may already be familiar with this, so I apologize if this is a repeat). In short, the only way to avoid the multiple memory safe accesses for an AOT multiple instance sandboxing compiler is to reserve a register which exclusively holds the value for data (the register cannot be used for any other computation).

The expectation is that pinned registers would buy a few percent of speedup in code with a lot of memory accesses(I think cranelift/wasmtime has some benchmarks in their compiler that show this), but this change would slow down CPU bottlenecked code due to the added register pressure. So depending on computation, this may or may not be something you'd want.

More to the point for wasm2c, there is no C compliant way to pin registers (the register keyword in C does not do the right thing for this). You'd have to mess with the core of the compiler to get the right behavior for pinned registers.

Given these factors, my own take is that the current wasm2c approach is the right tradeoff.

yhdengh · 2022-01-26T01:05:29Z

It sounds like we're on the same page. PR submitted.

yhdengh mentioned this issue Jan 26, 2022

wasm2c: add support for module instancing #1814

Merged

yhdengh closed this as completed Jan 26, 2022

keithw mentioned this issue Mar 6, 2024

wasm2c: Allow use of x86 segment register for linear memory access #2395

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thoughts on making wasm2c thread-safe #1805

Thoughts on making wasm2c thread-safe #1805

yhdengh commented Jan 13, 2022 •

edited by keithw

Loading

kripken commented Jan 20, 2022

kripken commented Jan 20, 2022

shravanrn commented Jan 21, 2022

yhdengh commented Jan 26, 2022

Thoughts on making wasm2c thread-safe #1805

Thoughts on making wasm2c thread-safe #1805

Comments

yhdengh commented Jan 13, 2022 • edited by keithw Loading

Status quo wasm2c

Toy model of the expected output after our changes

kripken commented Jan 20, 2022

kripken commented Jan 20, 2022

shravanrn commented Jan 21, 2022

yhdengh commented Jan 26, 2022

yhdengh commented Jan 13, 2022 •

edited by keithw

Loading