Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure replication jobs migrate when the shard map changes #5093

Merged
merged 1 commit into from
Jun 21, 2024

Conversation

nickva
Copy link
Contributor

@nickva nickva commented Jun 18, 2024

Previously, we didn't account for all the possible migration patterns, specifically, when a replication job would migrate from node A to node C, when we move a shard from A to B. The job would stop on node A, on node B the replicator would notice it's not the owner, and ignore the change. Node C, however, would never experience any shard file changes, so it would never know it should start the replication job.

To fix the migration logic, pick a naive approach and make the replication doc processor retain the replication records not just for the jobs it owns, but also for the shard copies for which it could be a potential owner. Then, periodically, check ownership, and start/stop jobs accordingly.

The difference from before is now we have a new replicator doc processor state: not_owner. Docs in this state still track all the doc updates, but otherwise stay inactive: they don't run jobs and they don't update replication documents on errors. Only if the node becomes the owner, it switches the record to initializing state and the job starts to run.

To avoid unnecessary duplicate job dueling, while other nodes still haven't run their membership checks, the usurped jobs are stopped immediately during a membership check, but those that should run are started with a delay. That means that jobs which migrates, may stop running for a minute or so, and then should start running on another node.

Additionally, can remove couch_replicator_db_changes module which used to monitor cluster stability and restart the scanner on node connects and disconnect (membership checks happen in doc processor). The only thing left from couch_replicator_db_changes was the couch_multidb subsripttion which was moved to the scanner function in the doc_processor.

Since we removed couch_replicator_db_changes we also don't need the cluster stability monitoring so that module is also removed. The owner function is moved to the utils module and its unstable return value eliminated. The membership check in the doc processor will stop jobs immediately but delay their startup so we get the same effect as before with the stability check.

Previously the couch_replicator_clustering module also applied an initial startup delay to allow the application to start properly before starting to scan for jobs so to retain compatibility that setting was moved to doc processor.

The new behavior should be covered by tests in the doc processor. Since the tests were updated, took the opportunity to switch the doc processor tests suites to use TDEF_FE macro, which removes few redundant setup lines (begin/end ...).

@nickva nickva force-pushed the fix-replicator-job-migration-on-rebalancing branch from 3f4124a to db76759 Compare June 18, 2024 22:52
@nickva nickva force-pushed the fix-replicator-job-migration-on-rebalancing branch from db76759 to 1df2636 Compare June 19, 2024 15:14
@@ -108,11 +112,6 @@ get_worker_ref({DbName, DocId}) when is_binary(DbName), is_binary(DocId) ->
nil
end.

% Cluster membership change notification callback
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the cluster membership change notification any longer as the check_membership periodic check would take care of it. couch_replicator_db_changes will still initiate a rescan when cluster nodes connect/disconnect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a nice cleanup that removes most handle_cast clauses.

@nickva
Copy link
Contributor Author

nickva commented Jun 20, 2024

Playing with the new PR locally I noticed that the couch replicator clustering and the db monitoring separate module were not needed any longer and removed. Added a separate [fixup] commit to explain why they were not needed any longer.

@nickva
Copy link
Contributor Author

nickva commented Jun 20, 2024

To test the job migration even when the local shard doesn't get updated, I used an N=5 db and then stopped one of the nodes where the job wasn't running. Then noticed that the job correctly migrated to one of the remaining N=3 nodes.

I used this silly script to create the replication job:

#!/bin/bash
# make && ./dev/run --admin=adm:pass -n 5 --with-haproxy

S=localhost:5984
SA=adm:pass@${S}
JSON="Content-Type: application/json"

curl -s -XDELETE $SA/t > /dev/null
curl -s -XDELETE $SA/s > /dev/null
curl -s -XDELETE $SA/_replicator > /dev/null

sleep 1
echo "Creating _replicator db"
curl -f -s -XPUT $SA/_replicator'?n=5' > /dev/null

sleep 1
echo "Creating src and tgt dbs"
curl -f -s -XPUT $SA/t > /dev/null
curl -f -s -XPUT $SA/s > /dev/null

echo "Posting replication job"
curl -f -XPUT $SA/_replicator/r -H "$JSON" \
 -d '{"source":"http:https://adm:pass@localhost:5984/s", "target":"http:https://adm:pass@localhost:5984/t", "worker_processes":1, "continuous":true}'

Cluster started with:

make && ./dev/run --admin=adm:pass -n 5 --with-haproxy

As a result my replication job ended up running on NODE4. I then stopped NODE2 and that altered the hashing used in the ownership algorithm and the new owner becomes NODE1:

NODE4
Replication doc shards/00000000-7fffffff/_replicator.1718859537:r with id {"5fe95cdd2e5718789c106b986a2d8340","+continuous"} usurped by node '[email protected]'
couch_replicator_scheduler: Job {"5fe95cdd2e5718789c106b986a2d8340","+continuous"} stopped as <0.19718.0>

NODE2
([email protected])1> init:stop().
([email protected])2> *** ERROR: Shell process terminated! (^G to start new job) ***

NODE1
Node '[email protected]' is the usurper for doc shards/00000000-7fffffff/_replicator.1718859537:r with id nil
couch_replicator_scheduler: Job {"5fe95cdd2e5718789c106b986a2d8340","+continuous"} started as <0.21919.0>

Copy link
Contributor

@jaydoane jaydoane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Way to fix a nasty corner case, and also reduce the amount of code, and complexity at the same time. Brilliant work!

@@ -108,11 +112,6 @@ get_worker_ref({DbName, DocId}) when is_binary(DbName), is_binary(DocId) ->
nil
end.

% Cluster membership change notification callback
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a nice cleanup that removes most handle_cast clauses.

Previously, we didn't account for all the possible migration patterns,
specifically, when a replication job would migrate from node A to node C, when
we move a shard from A to B. The job would stop on node A, on node B the
replicator would notice it's not the owner, and ignore the change. Node C,
however, would never experience any shard file changes, so it would never know
it should start the replication job.

To fix the migration logic, pick a naive approach and make the replication doc
processor retain the replication records not just for the jobs it owns, but
also for the shard copies for which it could be a potential owner. Then,
periodically, check ownership, and start/stop jobs accordingly.

The difference from before is now we have a new replicator doc processor
state: `not_owner`. Docs in this state still track all the doc updates, but
otherwise stay inactive: they don't run jobs and they don't update replication
documents on errors. Only if the node becomes the owner, it switches the record
to `initializing` state and the job starts to run.

To avoid unnecessary duplicate job dueling, while other nodes still haven't run
their membership checks, the usurped jobs are stopped immediately during a
membership check, but those that should run are started with a delay. That
means that jobs which migrates, may stop running for a minute or so, and then
should start running on another node.

Additionally, can remove couch_replicator_db_changes module which used to
monitor cluster stability and restart the scanner on node connects and
disconnect (membership checks happen in doc processor). The only thing left
from couch_replicator_db_changes was the couch_multidb subsription which was
moved to the scanner function in the doc_processor.

Since we removed couch_replicator_db_changes we also don't need the cluster
stability monitoring so that module is also removed. The owner function is
moved to the utils module and it's `unstable` return value eliminated. The
membership check in the doc processor will stop jobs immediately but delay
their startup so we get the same effect as before with the stability check.

Previously the couch_replicator_clustering module also applied an initial
startup delay to allow the application to start properly before starting to
scan for jobs so to retain compatibility that setting was moved to doc
processor.

The new behavior should be covered by tests in the doc processor. Since the
tests were updated, took the opportunity to switch the doc processor tests
suites to use `TDEF_FE` macro, which removes few redundant setup lines
(begin/end ...).
@nickva nickva force-pushed the fix-replicator-job-migration-on-rebalancing branch from 2f68d18 to 9c701a6 Compare June 21, 2024 15:47
@nickva nickva merged commit a563d05 into main Jun 21, 2024
17 checks passed
@nickva nickva deleted the fix-replicator-job-migration-on-rebalancing branch June 21, 2024 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants