-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Experiencing possible race condition with pthread_cond_timedwait. #9840
Comments
@g2gps did you try it on some ARCH more stable like STM32 or ESP32? |
Also it would be good if you can try if disabling priority inheritance gives any change to reproduction rate. |
Thanks for the feedback. Enabling or disabling priority inheritance doesn't have a noticeable effect on the time until failure. I did try building and running on an
Same result. After a period of time, the signal is missed. |
This is an issue with our application logic and not NuttX. What's happening is the |
I'm using a combination conditional variables and mutex's for thread synchronization on a shared resource. I seem to be experiencing a race condition, in which a signal on a condition variable is missed, which results in the application logic becoming locked, with every thread is waiting on my interlocking mutex's.
I've attempted to extract the problematic logic out of our existing application, into a minimal reproducible example, which still encounters the issue:
Here's what I believe to be the signification parts of my defconfig:
Here's a sample of the syslog for this program, around the point where the signal seems to be missed.
[ 6] [ INFO] pthread_cond_clockwait: cond=0x10401038 mutex=0x1040101c abstime=0x1080bf78 [ 6] [ INFO] pthread_cond_clockwait: Give up mutex... [13] [ INFO] trigger_generator: trigger start, core -1, ser -1 [13] [ INFO] trigger_start: Lock serialise: core -1, ser -1 [13] [ INFO] pthread_mutex_timedlock: mutex=0x10401000 [13] [ INFO] pthread_mutex_timedlock: Returning 0 [13] [ INFO] trigger_start: Lock core: core -1, ser 13 [13] [ INFO] pthread_mutex_timedlock: mutex=0x1040101c [13] [ INFO] pthread_mutex_timedlock: Returning 0 [13] [ INFO] trigger_start: Signal on cmd: core 13, ser 13 [13] [ INFO] pthread_cond_signal: cond=0x10401038 [13] [ INFO] pthread_cond_signal: sval=-1 [13] [ INFO] pthread_cond_signal: Signalling... [13] [ INFO] pthread_cond_signal: Returning 0 [13] [ INFO] trigger_start: Wait on result: core 13, ser 13 [13] [ INFO] pthread_cond_wait: cond=0x1040104c mutex=0x1040101c [13] [ INFO] pthread_cond_wait: Give up mutex / take cond [ 6] [ INFO] pthread_cond_clockwait: Re-locking... [ 6] [ INFO] pthread_cond_clockwait: Returning 110 [ 6] [ INFO] worker: timed wait rd 110, core 6, ser 13 [ 6] [ INFO] worker: Finished work: core 6, ser 13 [ 6] [ INFO] worker: Wait on cmd: core 6, ser 13
From this output:
At this point, I would have thought
pthread_cond_timedwait
would return 0, since the signal did occur. However it seems that the signal is delivered either before or after the semaphore (in the command condition variable) is waited on.There's a few things I've experimented to try and narrow down the issue:
pthread_cond_timedwait
topthread_cond_wait
. -> Fixes the issue I'm seeing. However, our application requires the usage of timedwait here.CONFIG_DEBUG_FEATURES
to remove syslog output. -> Same issue, however it happens less often.qemu-rv32
port. -> Haven't been able to reproduce the issue. Although I haven't run it for extended durations.There's a few comments I've found while looking through the source, which makes me think that an assumption is made that the
clock_id
time source is stopped if interrupts are disabled, which may not be the case on all platforms.From
sched/semaphore/sem_clockwait.c
:From
sched/pthread/pthread_condclockwait.c
:If that is the case, I could see how the absolute time given to pthread_cond_timedwait may have expired before
nxsem_wait
is called innxsem_clockwait
. However, I would have through that the scheduler is locked between when the mutex is released and the wait is started, so this shouldn't have any effect.I've spent a bit of time trying to understand the interactions, but I don't believe I have a full understanding of the implementation in Nuttx to make any feasible progress. At this point I'm not sure if it's an issue with the
Litex
port, the kernel configuration, or an oversight in our application.I'd appreciate any assistance tracking down the issue I'm seeing.
The text was updated successfully, but these errors were encountered: