-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MDEV-31151: Fix a stack overflow in pinbox allocator #2541
base: 10.5
Are you sure you want to change the base?
Conversation
|
@LinuxJedi @grooverdan This fixes a rather serious issue. There is no testable functional or performance regression, and a similar fix has been in MySQL already for a long time. Could you please review it? Thanks! |
Could you elaborate on why the stack size calculation is inaccurate? In particular, why |
@vuvova Thank you for checking the PR.
Sure. Let me know if I missed anything but from my understanding, as i pointed in the PR description,
The address of The corresponding code is https://github.com/MariaDB/server/blob/mariadb-10.6.11/sql/sql_connect.cc#L1390-L1398. The Here's related code to allocate the stack size during new connection:
You can simply verify this inaccurate value with following steps:
|
smap info about the stack of
|
MTR on builder amd64-fedora-36 failed on:
Seems unrelated to the change in this PR. |
I backported this to 10.11 and tested running the full MTR (including --big-test) on amd64/arm64/ppc64el/s390x, and I also included this in a package for which I ran extensive install/upgrade testing on Debian. I did not come across any issues introduced by this change, so all this testing is in favor of merging this. |
We've both got not infrequent commitments beyond PRs so response time can be vary and with the current volume its easy to miss. For the most serious regressions or crashing issues can you use a JIRA Issue with a critical/blocker priority and an assigned Fix versions. Thanks for the clear explanation @HugoWenTD. Leaving it with @vuvova for now since he's looked at stack issues far more than me, but I will catch up sometime. |
This was cherry-picked to the official Debian/Ubuntu package of MariaDB 10.11.2 in https://salsa.debian.org/mariadb-team/mariadb-server/-/commit/d1ea0bda86b99a499d91b40496fefac2a250695c |
Created Jira https://jira.mariadb.org/browse/MDEV-31151 for better tracking this issue as suggested by @grooverdan . |
My interpretation of https://stackoverflow.com/questions/53827/checking-available-stack-size-in-c agrees with the Description. Related to this, in d0abbdf I replaced |
I can think of a more reliable way of implementing a more accurate stack size check. The idea would be to assign the stack pointer to a thread-local variable (not #include <stdio.h>
inline size_t read_sp()
{
size_t sp;
__asm__("mov %%rsp,%0" : "=r" (sp));
return sp;
}
static thread_local size_t stack_start;
#ifdef __clang__
__attribute__((optnone))
#elif defined __GNUC__
__attribute__((optimize(0)))
#endif
static void f(int i)
{
if (stack_start - read_sp() < 1000)
f(i + 1);
else
printf("%d\n", i);
}
int main ()
{
stack_start = read_sp();
f(0);
} Optimizations on the recursive function had to be disabled to avoid an infinite loop or an immediate termination. The recursion depth would reach 20 or 30 when compiled with clang or gcc. An additional problem with Possibly this idea could actually be implemented in a rather portable way: thread_local char *stack_start;
pthread_handler_t thread_handler(void *arg)
{
char dummy;
stack_start = &dummy;
// the rest of the thread routine
} Problems around stack frame alignment would still remain, and we’d have to reliably calculate the stack pointer right before the If it can be demonstrated that there is no performance degradation, merging this pull would seem to be the cleanest solution. |
@dr-m Thank you for reviewing this pull request.
Although comprehensive performance verification for non-regression is challenging, as mentioned in the PR description, I had run the mini-benchmark script to validate the change. Additionally, the commit of this pull request has been merged into AWS RDS MariaDB 10.6+ branches since March 2023. We have not encountered the same crash after applying this fix, and our RDS benchmarking system has not detected any regressions. |
MariaDB supports a "wait-free concurrent allocator based on pinning addresses". In `lf_pinbox_real_free()` it tries to sort the pinned addresses for better performance to use binary search during "real free". `alloca()` was used to allocate stack memory and copy addresses. To prevent a stack overflow when allocating the stack memory the function checks if there's enough stack space. However, the available stack size was calculated inaccurately which eventually caused database crash due to stack overflow. The crash was seen on MariaDB 10.6.11 but the same code defect exists on all MariaDB versions. A similar issue happened previously and the fix in fc2c1e4 was to add a `ALLOCA_SAFETY_MARGIN` which is 8192 bytes. However, that safety margin is not enough during high connection workloads. MySQL also had a similar issue and the fix mysql/mysql-server@b086fda was to remove the use of `alloca` and replace qsort approach by a linear scan through all pointers (pins) owned by each thread. This commit is mostly the same as it is the only way to solve this issue as: 1. Frame sizes in different architecture can be different. 2. Number of active (non-null) pinned addresses varies, so the frame size for the recursive sorting function `msort_with_tmp` is also hard to predict. 3. Allocating big memory blocks in stack doesn't seem to be a very good practice. For further details see the mentioned commit in MySQL and the inline comments. All new code of the whole pull request, including one or several files that are either new files or modified ones, are contributed under the BSD-new license. I am contributing on behalf of my employer Amazon Web Services, Inc.
Rebased to 10.5. |
Description
MariaDB supports a "wait-free concurrent allocator based on pinning addresses".
In
lf_pinbox_real_free()
it tries to sort the pinned addresses for betterperformance to use binary search during "real free".
alloca()
was used toallocate stack memory and copy addresses.
To prevent a stack overflow when allocating the stack memory the function checks
if there's enough stack space. However, the available stack size was calculated
inaccurately which eventually caused database crash due to stack overflow.
The crash was seen on MariaDB 10.6.11 but the same code defect exists on all
MariaDB versions.
Crash stack trace example:
Why was the available stack size calculation inaccurate?
MariaDB defines the
thd->thread_stack
to be used as a stack threadpointer. However, that's not the accurate value of the stack starting
address instead it's the address of the
thd
object.thd->thread_stack= (char*) &thd;
That inaccurate value is then used to calculate the variable of
stack_ends_here
which is eventually used in functionlf_pinbox_real_free()
to get the available stack size.The end result is inaccurate and could cause stack overflow in the
qsort() function, especially when there are high connection numbers.
e.g. 8000-12000 concurrent connections with MariaDB system variable
thread_stack
=262144.https://mariadb.com/kb/en/server-system-variables/#thread_stack
Additionally it's easier to be reproduced on ARM instances, it's
likely related to there are more registers to save/restore on ARM64
and the structure alignment on ARM tends to have stronger alignment
requirements.
Those differences cause slightly different stack frame sizes, and with
the higher performance of AWS r6g.8xlarge instances, it may cause
higher number of pinned addresses in the pinbox.
A similar issue happened previously and the fix in fc2c1e4 was to add a
ALLOCA_SAFETY_MARGIN
which is 8192 bytes. However, that safety margin is notenough during high connection workloads.
MySQL also had a similar issue and the fix
mysql/mysql-server@b086fda was to remove the use of
alloca
and replace qsort approach by a linear scan through all pointers (pins)owned by each thread.
This commit is mostly the same as it is the only way to solve this issue as:
size for the recursive sorting function
msort_with_tmp
is also hardto predict.
Note the number of active pinned addresses usually is very small,
with 8000 active connections on an AWS r6g.8xlarge instance it's
about less than 10. But we don't know if there's edge case that the
value could be bigger.
practice.
For further details see the mentioned commit in MySQL and the inline comments.
In addition,
thd->thread_stack
andstack_ends_here
should not beused to calculate if we need an accurate value related to the stack
size. However
stack_ends_here
is already used in many places so Iwould not deprecate or change it in this commit.
Mini-benchmarking test had been done with script
https://github.com/MariaDB/server/blob/10.11/support-files/mini-benchmark.sh
and no performance regression was seen.
How can this PR be tested?
MTR and mini-benchmark tests had been done to verify there's no regression.
Basing the PR against the correct MariaDB version
Copyright
All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer Amazon Web
Services, Inc.