Race in worker_loop #994

talex5 · 2023-07-11T14:52:51Z

worker_loop starts by masking signals, since if a signal occurs in this thread it will crash the process:

Lines 968 to 977 in 90e0d6e

 static void *worker_loop(void *data) { 

 lwt_unix_job job = (lwt_unix_job)data; 

 #if defined(HAVE_PTHREAD) 

 /* Block all signals, otherwise ocaml handlers defined with the 

  module Sys may be executed in this thread, oops... */ 

 sigset_t mask; 

 sigfillset(&mask); 

 pthread_sigmask(SIG_SETMASK, &mask, NULL); 

 #endif

However, there is a slight chance that a signal will occur before this. Here's a test case:

let pid = Unix.getpid ()

let some_worker_job () =
  Lwt_unix.gethostbyname "www.google.com"

let _ =
  Sys.(set_signal sigusr1) @@ Signal_handle (fun _ -> print_endline "Got USR1");
  Lwt_main.run begin
    let job = some_worker_job () in
    print_endline "Mask USR1 in main thread to make sure worker gets it";
    ignore (Thread.(sigmask SIG_BLOCK) [Sys.sigusr1] : _ list);
    print_endline "Sending USR1";
    assert (Unix.system ("kill -USR1 " ^ string_of_int pid) = Unix.WEXITED 0);
    job
  end

If I add a small delay then it reliably segfaults:

static void *worker_loop(void *data) {
  lwt_unix_job job = (lwt_unix_job)data;

  for (int64_t i = 1; i < 10000000; i++) {
    __asm__ __volatile__("");
  }

  ...

Starting new thread                    
Mask USR1 in main thread to make sure worker gets it
Sending USR1
fish: Job 1, './_build/default/test/unix/sign…' terminated by signal SIGSEGV (Address boundary error)

I spotted this while trying to track down some segfaults in my program. I'm not sure if this is the cause of those, as the race looks hard to hit without the delay.

Using pthread_attr_setsigmask_np to set an initial mask might fix it.

The text was updated successfully, but these errors were encountered:

talex5 · 2023-07-11T15:40:32Z

After some more checks, I can confirm this bug is the cause of the crash I saw. Here's a gdb session with a coredump from a real crash:

(gdb) bt
#0  0x000055abbb34cc69 in interrupt_domain (
    s=<error reading variable: Asked for position 0 of stack, stack only has 0 elements on it.>)
    at runtime/domain.c:297
#1  caml_interrupt_self () at runtime/domain.c:1521
#2  0x000055abbb36636e in caml_record_signal (signal_number=17) at runtime/signals.c:123
#3  handle_signal (signal_number=17) at runtime/signals.c:587
#4  <signal handler called>
#5  start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:432
#6  0x00007f427801b5bc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

(gdb) fr 5
#5  start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:432
432	      if (pd->c11)

(gdb) disass
=> 0x00007f4277f9ae26 <+278>:	cmpb   $0x0,0x8f8(%rbx)

(gdb) info registers rbx
rbx            0x7f42678a76c0      139923181696704

(gdb) p ((struct pthread *) 0x7f42678a76c0)->start_routine 
$7 = (void *(*)(void *)) 0x55abbb33af10 <worker_loop>

So it was just about to start worker_loop when the signal happened.

talex5 mentioned this issue Jul 11, 2023

Make signal handling work on OCaml 5 and with Eio #993

Merged

talex5 mentioned this issue Jul 11, 2023

Initial OCaml 5 / Eio port ocurrent/solver-service#71

Merged

raphael-proust closed this as completed in #993 Jul 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race in worker_loop #994

Race in worker_loop #994

talex5 commented Jul 11, 2023

talex5 commented Jul 11, 2023

Race in worker_loop #994

Race in worker_loop #994

Comments

talex5 commented Jul 11, 2023

talex5 commented Jul 11, 2023