Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assertion failed at file:tls/tls_getset.c line: 59 #3868

Closed
yamt opened this issue Jun 8, 2021 · 12 comments · Fixed by #3877
Closed

Assertion failed at file:tls/tls_getset.c line: 59 #3868

yamt opened this issue Jun 8, 2021 · 12 comments · Fixed by #3877

Comments

@yamt
Copy link
Contributor

yamt commented Jun 8, 2021

i occasionally get an assertion failure.

up_assert: Assertion failed at file:tls/tls_getset.c line: 59 task: pt-0x400edfd0

a backtrace:

0x400d5d87: xtensa_assert at xtensa_assert.c:93
0x400d5de3: up_assert at xtensa_assert.c:169 (discriminator 2)
0x400d3f89: _assert at lib_assert.c:36
0x400d5022: tls_get_set at tls_getset.c:59 (discriminator 1)
0x400d4fb4: tls_destruct at tls_destruct.c:63 (discriminator 2)
0x400d40d7: pthread_exit at pthread_exit.c:61
0x400ee088: iperf_report_task at iperf.c:231
0x400e94c1: pthread_startup at pthread_create.c:59 (discriminator 2)
0x400f25ab: pthread_start at pthread_create.c:191 (discriminator 4)

i guess that a detached thread is exiting after the main thread exited.

@xiaoxiang781216
Copy link
Contributor

It's a problem if the background thread is still running but the main thread exit.

@yamt
Copy link
Contributor Author

yamt commented Jun 8, 2021

@xiaoxiang781216 it's my understanding too. do you have an idea how to fix it?

@xiaoxiang781216
Copy link
Contributor

xiaoxiang781216 commented Jun 8, 2021

@patacongo and @no1wudi do you have any idea for this especial case?

@patacongo
Copy link
Contributor

patacongo commented Jun 8, 2021

it does look like the main thread has exited or that the tg_pid is not valid for some reason. The tg_pid is set when the task group is create and is never changed until the task group is de-allocated. So until there is a wild write, I would expect the tg_pid to be valid. The more likely thing is that the main thread has exited.

That is really very likely under circumstances. pthreads are not really canceled on exit() in deferred cancellation mode: They may only be marked as deleted and will continue to run until the thread calls a cancellation point. That could be some time later. Meanwhile, the main thread could exit couldn't it.

  1. Main thread calls exit()
  2. All pthreads are canceled, but one or more persists because they are in deferred cancellation mode
  3. pthread exits and attempts to call data destructors
  4. That fails and assertion occurs because the main thread has already exited.

Or if a pthread is waiting on a semaphore and cancelled, but the OS logic ignores the ECANCELED error, then the pthread will never exit will never exit.

In either case, in these states tg_pid will be valid, but nxsched_get_tcb() will fail. Perhaps there is some kind of protection that I do not see, but I think this could happen in normal operation, couldn't it? If so, then one solution would be to remove the assertion.

I suspect that there are other possibilities for race conditions. So example, I think this could cause a similar problem:

What happens when a pthread calls exit()? In that case, it looks like the main thread could be killed before that pthread is killed. group_kill_children() will kill the main thread and the pthread will be the last thread to exit (via pthread_exit). Then when that pthread calls its destructors, the main thread would not exist the assertion would occur.

  1. pthread exits
  2. All threads killed or cancelled except for the calling pthread
  3. Calling pthread exits and attempts to call data destructors
  4. Assertion occurs because the main thread was killed in step 2.

I am just speculating about ways that the ordering could change to to race conditions. I don't know if any of the above are truly possible. If any of these race conditions do occur than it is likely that the persisting pthreads could be operating on stale memory since, most likely, the group structure will have been freed.

@no1wudi
Copy link
Contributor

no1wudi commented Jun 9, 2021

@yamt Could you have a test again without #3858 (revert it) ?

Before #3858, TLS destructors stored in the task_group_s, it would be freed when all thread exit (by group_leave).

How about keep the main thread stack alive after all thread exit like task_group_s?

yamt added a commit to yamt/incubator-nuttx that referenced this issue Jun 9, 2021
This reverts commit cc514d7.

* It introduced a regression.
  apache#3868

* It seems conceptually wrong to have per-process data in
  the main thread's stack.
yamt added a commit to yamt/incubator-nuttx that referenced this issue Jun 9, 2021
This reverts commit cc514d7.

* It introduced a regression.
  apache#3868

* It seems conceptually wrong to have per-process data in
  the main thread's stack.
@yamt
Copy link
Contributor Author

yamt commented Jun 9, 2021

@yamt Could you have a test again without #3858 (revert it) ?

Before #3858, TLS destructors stored in the task_group_s, it would be freed when all thread exit (by group_leave).

thank you for the pointer.
i tested and submitted a revert.
#3877

How about keep the main thread stack alive after all thread exit like task_group_s?

i suspect it's better to introduce a real per-task-group structure than keeping abusing main thread.
how do you think?

xiaoxiang781216 pushed a commit that referenced this issue Jun 9, 2021
This reverts commit cc514d7.

* It introduced a regression.
  #3868

* It seems conceptually wrong to have per-process data in
  the main thread's stack.
@xiaoxiang781216 xiaoxiang781216 linked a pull request Jun 9, 2021 that will close this issue
@patacongo
Copy link
Contributor

patacongo commented Jun 9, 2021

i suspect it's better to introduce a real per-task-group structure than keeping abusing main thread.
how do you think?

As I mention above, I think that the root case is the main can exit before the last pthread exits in certain race conditions. So I think that that correct solution is to add a reference count so that the main thread data persists until the last thread exits.

This logic used to exist in the past. The task group persists until the last thread exited as determined by a reference count. That was lost when we (1) corrected the behavior so that all threads are killed on exit() and (2) moved task data into the main thread's stack. Killing all threads on exit() is the correct one but requires that all threads be killed with knife-edge precision. However, pthread_exit() is inherently a lazy exit and the decision needs to be changed to handle a longer, drawn out exit sequence.

This problem has existed for some time. It will still exist after reverting no1wudi's change. However, the effect before that change was less visible since there was no task data that was accessed on exit.

I think that the code works when the TLS data is in the group structure instead of in the main thread's stack because the main thread's stack does not persist BUT there is referencing counting on the group structure and it will persist until the last thread exits and calls group_leave().

Similar reference counting would be needed on user-space data, whether it is the main thread's stack or is another user-space group structure as yamt proposes.

@xiaoxiang781216
Copy link
Contributor

xiaoxiang781216 commented Jun 9, 2021

TLS slot info was saved in task_group_s which already support the reference counting that's why the error doesn't happen before. But with #3858, TLS slot info is moved to the bottom of the main thread stack without the reference count.

@patacongo
Copy link
Contributor

patacongo commented Jun 9, 2021

TLS slot info was saved in task_group_s which already support the reference counting that's why the error doesn't happen before. But with this patch, TLS slot info is moved to the bottom of the main thread stack without the reference count.

Yes, we agree on this. One solution is as yamt suggested.

@xiaoxiang781216
Copy link
Contributor

It's a good idea to move all TLS info and function from kernel to userspace, @no1wudi please think about the yamt's suggestion.

@yamt
Copy link
Contributor Author

yamt commented Jun 9, 2021

btw, what was the main motivation of moving the info out of the kernel?

@xiaoxiang781216
Copy link
Contributor

To refine the implementation seperation of kernel and userspace, there are many functions which is pure userspace functionality but put into kernel space wrongly.

unixjet pushed a commit to unixjet/incubator-nuttx that referenced this issue Jun 11, 2021
This reverts commit cc514d7.

* It introduced a regression.
  apache#3868

* It seems conceptually wrong to have per-process data in
  the main thread's stack.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants