-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assertion failed at file:tls/tls_getset.c line: 59 #3868
Comments
It's a problem if the background thread is still running but the main thread exit. |
@xiaoxiang781216 it's my understanding too. do you have an idea how to fix it? |
@patacongo and @no1wudi do you have any idea for this especial case? |
it does look like the main thread has exited or that the tg_pid is not valid for some reason. The tg_pid is set when the task group is create and is never changed until the task group is de-allocated. So until there is a wild write, I would expect the tg_pid to be valid. The more likely thing is that the main thread has exited. That is really very likely under circumstances. pthreads are not really canceled on exit() in deferred cancellation mode: They may only be marked as deleted and will continue to run until the thread calls a cancellation point. That could be some time later. Meanwhile, the main thread could exit couldn't it.
Or if a pthread is waiting on a semaphore and cancelled, but the OS logic ignores the ECANCELED error, then the pthread will never exit will never exit. In either case, in these states tg_pid will be valid, but nxsched_get_tcb() will fail. Perhaps there is some kind of protection that I do not see, but I think this could happen in normal operation, couldn't it? If so, then one solution would be to remove the assertion. I suspect that there are other possibilities for race conditions. So example, I think this could cause a similar problem: What happens when a pthread calls exit()? In that case, it looks like the main thread could be killed before that pthread is killed. group_kill_children() will kill the main thread and the pthread will be the last thread to exit (via pthread_exit). Then when that pthread calls its destructors, the main thread would not exist the assertion would occur.
I am just speculating about ways that the ordering could change to to race conditions. I don't know if any of the above are truly possible. If any of these race conditions do occur than it is likely that the persisting pthreads could be operating on stale memory since, most likely, the group structure will have been freed. |
This reverts commit cc514d7. * It introduced a regression. apache#3868 * It seems conceptually wrong to have per-process data in the main thread's stack.
This reverts commit cc514d7. * It introduced a regression. apache#3868 * It seems conceptually wrong to have per-process data in the main thread's stack.
thank you for the pointer.
i suspect it's better to introduce a real per-task-group structure than keeping abusing main thread. |
As I mention above, I think that the root case is the main can exit before the last pthread exits in certain race conditions. So I think that that correct solution is to add a reference count so that the main thread data persists until the last thread exits. This logic used to exist in the past. The task group persists until the last thread exited as determined by a reference count. That was lost when we (1) corrected the behavior so that all threads are killed on exit() and (2) moved task data into the main thread's stack. Killing all threads on exit() is the correct one but requires that all threads be killed with knife-edge precision. However, pthread_exit() is inherently a lazy exit and the decision needs to be changed to handle a longer, drawn out exit sequence. This problem has existed for some time. It will still exist after reverting no1wudi's change. However, the effect before that change was less visible since there was no task data that was accessed on exit. I think that the code works when the TLS data is in the group structure instead of in the main thread's stack because the main thread's stack does not persist BUT there is referencing counting on the group structure and it will persist until the last thread exits and calls group_leave(). Similar reference counting would be needed on user-space data, whether it is the main thread's stack or is another user-space group structure as yamt proposes. |
TLS slot info was saved in task_group_s which already support the reference counting that's why the error doesn't happen before. But with #3858, TLS slot info is moved to the bottom of the main thread stack without the reference count. |
Yes, we agree on this. One solution is as yamt suggested. |
It's a good idea to move all TLS info and function from kernel to userspace, @no1wudi please think about the yamt's suggestion. |
btw, what was the main motivation of moving the info out of the kernel? |
To refine the implementation seperation of kernel and userspace, there are many functions which is pure userspace functionality but put into kernel space wrongly. |
This reverts commit cc514d7. * It introduced a regression. apache#3868 * It seems conceptually wrong to have per-process data in the main thread's stack.
i occasionally get an assertion failure.
a backtrace:
i guess that a detached thread is exiting after the main thread exited.
The text was updated successfully, but these errors were encountered: