Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Galera feature: retry applying of write sets at slave nodes #387

Open
wants to merge 5 commits into
base: 11.4
Choose a base branch
from

Conversation

plampio
Copy link

@plampio plampio commented Jan 9, 2024

Description

Initial version of the feature that allows retrying of applying of write sets at slave nodes. If Galera Cluster fails to apply a write set at a slave node, replication fails and the slave node is dropped from the cluster. However, the apply failure might be only temporary; a second apply attempt might succeed. Therefore, Galera Cluster might sometimes recover from such a failure by retrying applying of write sets at slave nodes.

This patch is quite separate from the server code and should not introduce any side-effects in other parts of the server.

How can this PR be tested?

This patch introduces two new MTR tests: galera.galera_retry_applying and galera.galera_retry_applying2 for testing the feature. In the first test, a retry succeeds and in the second test all retry attempts fail.

The new MTR tests need some more work: the cleanup after the tests is not complete.

@plampio plampio requested review from sjaakola and temeo January 9, 2024 09:52
--connection node_2
SET GLOBAL wsrep_applier_retry_count = 2;
SET GLOBAL debug_dbug = "+d,innodb_insert_fail:o,/dev/null";
CALL mtr.add_suppression("WSREP: Event 3 Write_rows_v1 apply failed: 146, seqno 4");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better not to include sequence numbers in suppressions, the seqno will not remain the same if the test is run repeatedly or part of full MTR run.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I will fix this as suggested.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed a suggested.

Copy link

@janlindstrom janlindstrom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but

is default 0 good choice? What are dangers to set it larger e.g. 5 ?

If the variable "wsrep_retry_applying" is nonzero, then a temporary
savepoint "wsrep-retry" is created for each apply of a writeset.
@@ -0,0 +1,40 @@
#

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The common practice in mtr testing is to bundle several test cases in same test file, this should yield faster testing. As galera_retry_applying and galera_retry_applying2 quite similar test cases, please accommodate both in galera_retry_applying.test and remove the galera_retry_applying2.test and result as well

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed as suggested.

SET GLOBAL debug_dbug = "+d,innodb_insert_fail:o,/dev/null";

--connection node_1
SLEEP 1;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there should be no reason for sleep here, please remove it.
It is possible to add wait_condition for connection node_2 to wait until table t2 has been replicated there, but as we have wsrep_sync_wait=15 by default, there should be no absolute reason to wait for the replication

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed as suggested.

} else if (0 == strcmp("test/t2", node->table->name.m_name)) {
err = DB_LOCK_WAIT_TIMEOUT;
goto error_handling;}});

err = row_ins(node, thr);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using hard-coded table names here for the failure injection works, but is not elegant.
We have recently created similar failure injection for FK failure cases, where the failure injection happens for a given number of times, and execution succeeds after last failure. Please take a look at commit ea08242 and DBUG_EXECUTE_IF implementation in storage/innobase/row/row0ins.cc. Same approach should work here.
The FK failure injection has also hard coded part to fail exactly 4 times, I assume this could be parametrized and enable the mtr script to decide how many failures are wanted (if we want to take this to the next level)

Merged "galera.galera_retry_applying2" into "galera.galera_retry_applying"
and remove MTR test "galera.galera_retry_applying2".
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants