Assertion failed at file:tls/tls_getset.c line: 59 #3868

yamt · 2021-06-08T07:54:21Z

i occasionally get an assertion failure.

up_assert: Assertion failed at file:tls/tls_getset.c line: 59 task: pt-0x400edfd0

a backtrace:

0x400d5d87: xtensa_assert at xtensa_assert.c:93
0x400d5de3: up_assert at xtensa_assert.c:169 (discriminator 2)
0x400d3f89: _assert at lib_assert.c:36
0x400d5022: tls_get_set at tls_getset.c:59 (discriminator 1)
0x400d4fb4: tls_destruct at tls_destruct.c:63 (discriminator 2)
0x400d40d7: pthread_exit at pthread_exit.c:61
0x400ee088: iperf_report_task at iperf.c:231
0x400e94c1: pthread_startup at pthread_create.c:59 (discriminator 2)
0x400f25ab: pthread_start at pthread_create.c:191 (discriminator 4)

i guess that a detached thread is exiting after the main thread exited.

The text was updated successfully, but these errors were encountered:

xiaoxiang781216 · 2021-06-08T10:29:01Z

It's a problem if the background thread is still running but the main thread exit.

yamt · 2021-06-08T13:11:46Z

@xiaoxiang781216 it's my understanding too. do you have an idea how to fix it?

xiaoxiang781216 · 2021-06-08T13:38:21Z

@patacongo and @no1wudi do you have any idea for this especial case?

patacongo · 2021-06-08T16:29:51Z

it does look like the main thread has exited or that the tg_pid is not valid for some reason. The tg_pid is set when the task group is create and is never changed until the task group is de-allocated. So until there is a wild write, I would expect the tg_pid to be valid. The more likely thing is that the main thread has exited.

That is really very likely under circumstances. pthreads are not really canceled on exit() in deferred cancellation mode: They may only be marked as deleted and will continue to run until the thread calls a cancellation point. That could be some time later. Meanwhile, the main thread could exit couldn't it.

Main thread calls exit()
All pthreads are canceled, but one or more persists because they are in deferred cancellation mode
pthread exits and attempts to call data destructors
That fails and assertion occurs because the main thread has already exited.

Or if a pthread is waiting on a semaphore and cancelled, but the OS logic ignores the ECANCELED error, then the pthread will never exit will never exit.

In either case, in these states tg_pid will be valid, but nxsched_get_tcb() will fail. Perhaps there is some kind of protection that I do not see, but I think this could happen in normal operation, couldn't it? If so, then one solution would be to remove the assertion.

I suspect that there are other possibilities for race conditions. So example, I think this could cause a similar problem:

What happens when a pthread calls exit()? In that case, it looks like the main thread could be killed before that pthread is killed. group_kill_children() will kill the main thread and the pthread will be the last thread to exit (via pthread_exit). Then when that pthread calls its destructors, the main thread would not exist the assertion would occur.

pthread exits
All threads killed or cancelled except for the calling pthread
Calling pthread exits and attempts to call data destructors
Assertion occurs because the main thread was killed in step 2.

I am just speculating about ways that the ordering could change to to race conditions. I don't know if any of the above are truly possible. If any of these race conditions do occur than it is likely that the persisting pthreads could be operating on stale memory since, most likely, the group structure will have been freed.

no1wudi · 2021-06-09T06:32:49Z

@yamt Could you have a test again without #3858 (revert it) ?

Before #3858, TLS destructors stored in the task_group_s, it would be freed when all thread exit (by group_leave).

How about keep the main thread stack alive after all thread exit like task_group_s?

This reverts commit cc514d7. * It introduced a regression. apache#3868 * It seems conceptually wrong to have per-process data in the main thread's stack.

yamt · 2021-06-09T07:01:44Z

@yamt Could you have a test again without #3858 (revert it) ?

Before #3858, TLS destructors stored in the task_group_s, it would be freed when all thread exit (by group_leave).

thank you for the pointer.
i tested and submitted a revert.
#3877

How about keep the main thread stack alive after all thread exit like task_group_s?

i suspect it's better to introduce a real per-task-group structure than keeping abusing main thread.
how do you think?

This reverts commit cc514d7. * It introduced a regression. #3868 * It seems conceptually wrong to have per-process data in the main thread's stack.

patacongo · 2021-06-09T13:39:19Z

i suspect it's better to introduce a real per-task-group structure than keeping abusing main thread.
how do you think?

As I mention above, I think that the root case is the main can exit before the last pthread exits in certain race conditions. So I think that that correct solution is to add a reference count so that the main thread data persists until the last thread exits.

This logic used to exist in the past. The task group persists until the last thread exited as determined by a reference count. That was lost when we (1) corrected the behavior so that all threads are killed on exit() and (2) moved task data into the main thread's stack. Killing all threads on exit() is the correct one but requires that all threads be killed with knife-edge precision. However, pthread_exit() is inherently a lazy exit and the decision needs to be changed to handle a longer, drawn out exit sequence.

This problem has existed for some time. It will still exist after reverting no1wudi's change. However, the effect before that change was less visible since there was no task data that was accessed on exit.

I think that the code works when the TLS data is in the group structure instead of in the main thread's stack because the main thread's stack does not persist BUT there is referencing counting on the group structure and it will persist until the last thread exits and calls group_leave().

Similar reference counting would be needed on user-space data, whether it is the main thread's stack or is another user-space group structure as yamt proposes.

xiaoxiang781216 · 2021-06-09T14:02:33Z

TLS slot info was saved in task_group_s which already support the reference counting that's why the error doesn't happen before. But with #3858, TLS slot info is moved to the bottom of the main thread stack without the reference count.

patacongo · 2021-06-09T14:03:29Z

TLS slot info was saved in task_group_s which already support the reference counting that's why the error doesn't happen before. But with this patch, TLS slot info is moved to the bottom of the main thread stack without the reference count.

Yes, we agree on this. One solution is as yamt suggested.

xiaoxiang781216 · 2021-06-09T14:14:05Z

It's a good idea to move all TLS info and function from kernel to userspace, @no1wudi please think about the yamt's suggestion.

yamt · 2021-06-09T23:08:01Z

btw, what was the main motivation of moving the info out of the kernel?

xiaoxiang781216 · 2021-06-10T02:38:12Z

To refine the implementation seperation of kernel and userspace, there are many functions which is pure userspace functionality but put into kernel space wrongly.

This reverts commit cc514d7. * It introduced a regression. apache#3868 * It seems conceptually wrong to have per-process data in the main thread's stack.

yamt mentioned this issue Jun 8, 2021

arch: Allocate the space from the beginning in up_stack_frame #3517

Merged

yamt mentioned this issue Jun 9, 2021

Revert "tls: Move pthread key destructor to libc" #3877

Merged

xiaoxiang781216 pushed a commit that referenced this issue Jun 9, 2021

Revert "tls: Move pthread key destructor to libc"

b3e8535

This reverts commit cc514d7. * It introduced a regression. #3868 * It seems conceptually wrong to have per-process data in the main thread's stack.

xiaoxiang781216 linked a pull request Jun 9, 2021 that will close this issue

Revert "tls: Move pthread key destructor to libc" #3877

Merged

xiaoxiang781216 closed this as completed Jun 9, 2021

no1wudi mentioned this issue Jun 29, 2021

libc: Move atexit/on_exit/pthread_key_destructor to user space #3951

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assertion failed at file:tls/tls_getset.c line: 59 #3868

Assertion failed at file:tls/tls_getset.c line: 59 #3868

yamt commented Jun 8, 2021

xiaoxiang781216 commented Jun 8, 2021

yamt commented Jun 8, 2021

xiaoxiang781216 commented Jun 8, 2021 •

edited

Loading

patacongo commented Jun 8, 2021 •

edited

Loading

no1wudi commented Jun 9, 2021

yamt commented Jun 9, 2021

patacongo commented Jun 9, 2021 •

edited

Loading

xiaoxiang781216 commented Jun 9, 2021 •

edited

Loading

patacongo commented Jun 9, 2021 •

edited

Loading

xiaoxiang781216 commented Jun 9, 2021

yamt commented Jun 9, 2021

xiaoxiang781216 commented Jun 10, 2021

Assertion failed at file:tls/tls_getset.c line: 59 #3868

Assertion failed at file:tls/tls_getset.c line: 59 #3868

Comments

yamt commented Jun 8, 2021

xiaoxiang781216 commented Jun 8, 2021

yamt commented Jun 8, 2021

xiaoxiang781216 commented Jun 8, 2021 • edited Loading

patacongo commented Jun 8, 2021 • edited Loading

no1wudi commented Jun 9, 2021

yamt commented Jun 9, 2021

patacongo commented Jun 9, 2021 • edited Loading

xiaoxiang781216 commented Jun 9, 2021 • edited Loading

patacongo commented Jun 9, 2021 • edited Loading

xiaoxiang781216 commented Jun 9, 2021

yamt commented Jun 9, 2021

xiaoxiang781216 commented Jun 10, 2021

xiaoxiang781216 commented Jun 8, 2021 •

edited

Loading

patacongo commented Jun 8, 2021 •

edited

Loading

patacongo commented Jun 9, 2021 •

edited

Loading

xiaoxiang781216 commented Jun 9, 2021 •

edited

Loading

patacongo commented Jun 9, 2021 •

edited

Loading