Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race in worker_loop #994

Closed
talex5 opened this issue Jul 11, 2023 · 1 comment · Fixed by #993
Closed

Race in worker_loop #994

talex5 opened this issue Jul 11, 2023 · 1 comment · Fixed by #993

Comments

@talex5
Copy link
Contributor

talex5 commented Jul 11, 2023

worker_loop starts by masking signals, since if a signal occurs in this thread it will crash the process:

static void *worker_loop(void *data) {
lwt_unix_job job = (lwt_unix_job)data;
#if defined(HAVE_PTHREAD)
/* Block all signals, otherwise ocaml handlers defined with the
module Sys may be executed in this thread, oops... */
sigset_t mask;
sigfillset(&mask);
pthread_sigmask(SIG_SETMASK, &mask, NULL);
#endif

However, there is a slight chance that a signal will occur before this. Here's a test case:

let pid = Unix.getpid ()

let some_worker_job () =
  Lwt_unix.gethostbyname "www.google.com"

let _ =
  Sys.(set_signal sigusr1) @@ Signal_handle (fun _ -> print_endline "Got USR1");
  Lwt_main.run begin
    let job = some_worker_job () in
    print_endline "Mask USR1 in main thread to make sure worker gets it";
    ignore (Thread.(sigmask SIG_BLOCK) [Sys.sigusr1] : _ list);
    print_endline "Sending USR1";
    assert (Unix.system ("kill -USR1 " ^ string_of_int pid) = Unix.WEXITED 0);
    job
  end

If I add a small delay then it reliably segfaults:

static void *worker_loop(void *data) {
  lwt_unix_job job = (lwt_unix_job)data;

  for (int64_t i = 1; i < 10000000; i++) {
    __asm__ __volatile__("");
  }

  ...
Starting new thread                    
Mask USR1 in main thread to make sure worker gets it
Sending USR1
fish: Job 1, './_build/default/test/unix/sign…' terminated by signal SIGSEGV (Address boundary error)

I spotted this while trying to track down some segfaults in my program. I'm not sure if this is the cause of those, as the race looks hard to hit without the delay.

Using pthread_attr_setsigmask_np to set an initial mask might fix it.

@talex5
Copy link
Contributor Author

talex5 commented Jul 11, 2023

After some more checks, I can confirm this bug is the cause of the crash I saw. Here's a gdb session with a coredump from a real crash:

(gdb) bt
#0  0x000055abbb34cc69 in interrupt_domain (
    s=<error reading variable: Asked for position 0 of stack, stack only has 0 elements on it.>)
    at runtime/domain.c:297
#1  caml_interrupt_self () at runtime/domain.c:1521
#2  0x000055abbb36636e in caml_record_signal (signal_number=17) at runtime/signals.c:123
#3  handle_signal (signal_number=17) at runtime/signals.c:587
#4  <signal handler called>
#5  start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:432
#6  0x00007f427801b5bc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

(gdb) fr 5
#5  start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:432
432	      if (pd->c11)

(gdb) disass
=> 0x00007f4277f9ae26 <+278>:	cmpb   $0x0,0x8f8(%rbx)

(gdb) info registers rbx
rbx            0x7f42678a76c0      139923181696704

(gdb) p ((struct pthread *) 0x7f42678a76c0)->start_routine 
$7 = (void *(*)(void *)) 0x55abbb33af10 <worker_loop>

So it was just about to start worker_loop when the signal happened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant