Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cfe/SCH deadlocks on exit on Linux #701

Closed
excaliburtb opened this issue May 13, 2020 · 9 comments · Fixed by nasa/osal#472
Closed

cfe/SCH deadlocks on exit on Linux #701

excaliburtb opened this issue May 13, 2020 · 9 comments · Fixed by nasa/osal#472
Assignees
Labels
Milestone

Comments

@excaliburtb
Copy link

using modules
95f34d2 cfe
c2bcebbc4d7e60a41b604e9acfc8af3c60b8536a osal
37ee8eb2d7ce006dc1570b920ae75a7ac5f89d27 psp

there seems to be a deadlock upon exit for timers being used by SCH.

See stacktrace

Thread 2 (Thread 0xef3ffb40 (LWP 19797)):
#0  0xf7766430 in __kernel_vsyscall ()
#1  0xf773e436 in __pause_nocancel () from /lib/libpthread.so.0
#2  0xf7734995 in __pthread_mutex_lock_full () from /lib/libpthread.so.0
#3  0x0807bbbe in OS_BinSemGive_Impl (sem_id=4) at /home/tbrain/cert_testbed/osal/src/os/posix/src/os-impl-binsem.c:250
#4  0x0807558c in OS_BinSemGive (sem_id=262148) at /home/tbrain/cert_testbed/osal/src/os/shared/src/osapi-binsem.c:187
#5  0xf7750628 in SCH_MinorFrameCallback (TimerId=589826) at /home/tbrain/cert_testbed/apps/sch_g/fsw/src/sch_custom.c:442
#6  0x0807b3a8 in OS_Timer_NoArgCallback (objid=589826, arg=0xf77503fe <SCH_MinorFrameCallback>) at /home/tbrain/cert_testbed/osal/src/os/shared/src/osapi-time.c:227
#7  0x0807b072 in OS_TimeBase_CallbackThread (timebase_id=524290) at /home/tbrain/cert_testbed/osal/src/os/shared/src/osapi-timebase.c:526
#8  0x0807df44 in OS_TimeBasePthreadEntry (arg=0x80002) at /home/tbrain/cert_testbed/osal/src/os/posix/src/os-impl-timebase.c:305
#9  0xf7736bbc in start_thread () from /lib/libpthread.so.0
#10 0xf76550de in clone () from /lib/libc.so.6

Thread 1 (Thread 0xf7555700 (LWP 19780)):
#0  0xf7766430 in __kernel_vsyscall ()
#1  0xf773497f in __pthread_mutex_lock_full () from /lib/libpthread.so.0
#2  0x0807dc46 in OS_TimeBaseLock_Impl (local_id=2) at /home/tbrain/cert_testbed/osal/src/os/posix/src/os-impl-timebase.c:108
#3  0x0807b63a in OS_TimerDelete (timer_id=589826) at /home/tbrain/cert_testbed/osal/src/os/shared/src/osapi-time.c:422
#4  0x08075ab8 in OS_CleanUpObject (object_id=589826, arg=0xffc049e8) at /home/tbrain/cert_testbed/osal/src/os/shared/src/osapi-common.c:263
#5  0x08078877 in OS_ForEachObject (creator_id=0, callback_ptr=0x8075a1c <OS_CleanUpObject>, callback_arg=0xffc049e8) at /home/tbrain/cert_testbed/osal/src/os/shared/src/osapi-idmap.c:1015
#6  0x08075b0a in OS_DeleteAllObjects () at /home/tbrain/cert_testbed/osal/src/os/shared/src/osapi-common.c:299
#7  0x08074ebe in OS_Application_Run () at /home/tbrain/cert_testbed/psp/fsw/pc-linux/src/cfe_psp_start.c:458
#8  0x080801d1 in main (argc=1, argv=0xffc04b64) at /home/tbrain/cert_testbed/osal/src/bsp/pc-linux/src/bsp_start.c:198
@skliper
Copy link
Contributor

skliper commented May 13, 2020

Is this just an order thing? Shouldn't applications get deleted before the timers?

EDIT - I see what you were saying now.. the callback needs to get unregistered

@jphickey
Copy link
Contributor

Is this reproducible or is it a race condition during shutdown? If a thread is canceled while it is holding a lock, this type of thing can happen. That's the risk with any sort of forced exit situation, which is why its preferable to get tasks to self-shutdown rather than forcibly delete them.

@excaliburtb
Copy link
Author

right.. as far as I can tell, Ctrl-C'ing the process is immediately killing the apps which prevents them from doing any clean shutdown which means cfe needs to do the cleanup. However, this behavior hasn't been a problem for the SCH code base for many versions of cfe. The question is, what changed? What should the app do? what should the cfe/osal/psp do?

@excaliburtb
Copy link
Author

this is an intermittent problem but occurs often enough that it isn't rare for it to occur

@jphickey
Copy link
Contributor

However, this behavior hasn't been a problem for the SCH code base for many versions of cfe

Is this to say you are finding this more frequently occurring in the latest baseline vs. older baselines?

If I'm interpreting correctly you are running the latest bleeding-edge baseline - which would have changed the CTRL+C handling to being treated as an exception and thereby flowing through the ER log/processor reset sequence. This still will do a forced delete of all tasks but it will possibly change the timing of when that occurs, and maybe order of operations? But that would have only changed in the most recent baseline.

@excaliburtb
Copy link
Author

as far as I know, it never occurred in the older baselines. and, yes, I am working with the bleeding edge master branches. (see initial comment for hashes).

@skliper skliper added the bug label May 16, 2020
@jphickey jphickey self-assigned this May 18, 2020
@jphickey
Copy link
Contributor

I am looking into this one, but unable to replicate the issue as I'm not sure what version/config of SCH is used here. However it could be simply that the OS_ForEachObject, which drives the cleanup operations, finds the tasks and semaphores before the timers.

@jphickey
Copy link
Contributor

@excaliburtb Is the backtrace posted in the initial summary showing every thread that still existed in the process or just the ones that were "stuck"?

In particular I'm wondering about the task which runs SCH_AppMain, which is not shown above. This would normally be inside a pthread_condwait() call, but may have gotten woken up due to the SIGINT. My hypothesis is that maybe it got woken up, but was deleted before it could release the lock.

@skliper
Copy link
Contributor

skliper commented Jun 5, 2020

Resolved by nasa/osal#470

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants