10.11 innodb wsrep applier lock wait timeout #2351

sjaakola · 2022-11-29T17:39:10Z

The Jira issue number for this PR is: MDEV-29684

Description

This patch implements wsrep applier lock wait timeout
functionality.

As transactions which are executed by appliers have passed
certification in the cluster, they must be applied and committed
successfully. However, occasionally BF aborting local transactions
may not work perfectly due to race conditions or unforeseen
behavior of the lock manager, which may cause appliers to wait
locks indefinitely. Especially if the local transaction has
already reached commit stage, it will not yield via lock
wait timeout.

In order to resolve indefinite applier waits, a short applier
lock wait timeout is introduced. However instead of giving up
with lock wait, a background thread is used to retry BF abort
on behalf of the applier which is waiting for the lock.

A variable to control the applier lock wait timeout is
innodb_wsrep_applier_lock_wait_timeout with default
value of five seconds. If the value is zero, the background
BF aborting is disabled.

The value of innodb_wsrep_applier_lock_wait_timeout is set
to zero in Galera suite MTR test configuration to avoid
non-deterministic behavior.

How can this PR be tested?

The PR contains a mtr test for testing the functionality

TODO: modify the automated test suite to verify that the PR causes MariaDB to
behave as intended. Consult the documentation on
"Writing good test cases".
In many cases, this will be as simple as modifying one .test and one .result
file in the mysql-test/ subdirectory. Without automated tests, future regressions
in the expected behavior can't be automatically detected and verified.

If the changes are not amenable to automated testing, please explain why not and
carefully describe how to test manually.

Basing the PR against the correct MariaDB version

[x ] This is a new feature and the PR is based against the latest MariaDB development branch
This is a bug fix and the PR is based against the earliest branch in which the bug can be reproduced

Backward compatibility

This patch implements wsrep applier lock wait timeout functionality. As transactions which are executed by appliers have passed certification in the cluster, they must be applied and committed successfully. However, occasionally BF aborting local transactions may not work perfectly due to race conditions or unforeseen behavior of the lock manager, which may cause appliers to wait locks indefinitely. Especially if the local transaction has already reached commit stage, it will not yield via lock wait timeout. In order to resolve indefinite applier waits, a short applier lock wait timeout is introduced. However instead of giving up with lock wait, a background thread is used to retry BF abort on behalf of the applier which is waiting for the lock. A variable to control the applier lock wait timeout is innodb_wsrep_applier_lock_wait_timeout with default value of five seconds. If the value is zero, the background BF aborting is disabled. A separate timer wsrep_BF_watchdog_timer was added to achieve one second resolution. The value of innodb_wsrep_applier_lock_wait_timeout is set to zero in Galera suite MTR test configuration to avoid non-deterministic behavior.

CLAassistant · 2022-11-29T17:39:18Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ sjaakola
❌ temeo
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

dr-m

Please fix the build and test failures.

Because I am concerned about potential performance regressions, I think that this will require extensive performance testing for the case that Galera replication is disabled.

I am confused about the testing status. MDEV-29684 mentions version 10.4, yet this change is targeting 10.11.

dr-m · 2022-12-05T07:10:21Z