Ensure replication jobs migrate when the shard map changes #5093

nickva · 2024-06-18T22:35:47Z

Previously, we didn't account for all the possible migration patterns, specifically, when a replication job would migrate from node A to node C, when we move a shard from A to B. The job would stop on node A, on node B the replicator would notice it's not the owner, and ignore the change. Node C, however, would never experience any shard file changes, so it would never know it should start the replication job.

To fix the migration logic, pick a naive approach and make the replication doc processor retain the replication records not just for the jobs it owns, but also for the shard copies for which it could be a potential owner. Then, periodically, check ownership, and start/stop jobs accordingly.

The difference from before is now we have a new replicator doc processor state: not_owner. Docs in this state still track all the doc updates, but otherwise stay inactive: they don't run jobs and they don't update replication documents on errors. Only if the node becomes the owner, it switches the record to initializing state and the job starts to run.

To avoid unnecessary duplicate job dueling, while other nodes still haven't run their membership checks, the usurped jobs are stopped immediately during a membership check, but those that should run are started with a delay. That means that jobs which migrates, may stop running for a minute or so, and then should start running on another node.

Additionally, can remove couch_replicator_db_changes module which used to monitor cluster stability and restart the scanner on node connects and disconnect (membership checks happen in doc processor). The only thing left from couch_replicator_db_changes was the couch_multidb subsripttion which was moved to the scanner function in the doc_processor.

Since we removed couch_replicator_db_changes we also don't need the cluster stability monitoring so that module is also removed. The owner function is moved to the utils module and its unstable return value eliminated. The membership check in the doc processor will stop jobs immediately but delay their startup so we get the same effect as before with the stability check.

Previously the couch_replicator_clustering module also applied an initial startup delay to allow the application to start properly before starting to scan for jobs so to retain compatibility that setting was moved to doc processor.

The new behavior should be covered by tests in the doc processor. Since the tests were updated, took the opportunity to switch the doc processor tests suites to use TDEF_FE macro, which removes few redundant setup lines (begin/end ...).

src/couch_replicator/src/couch_replicator_doc_processor.erl

src/couch_replicator/src/couch_replicator_scheduler.erl

nickva · 2024-06-19T16:20:04Z

src/couch_replicator/src/couch_replicator_doc_processor.erl

@@ -108,11 +112,6 @@ get_worker_ref({DbName, DocId}) when is_binary(DbName), is_binary(DocId) ->
 nil
 end.

-% Cluster membership change notification callback


We don't need the cluster membership change notification any longer as the check_membership periodic check would take care of it. couch_replicator_db_changes will still initiate a rescan when cluster nodes connect/disconnect.

That's a nice cleanup that removes most handle_cast clauses.

nickva · 2024-06-20T05:05:56Z

Playing with the new PR locally I noticed that the couch replicator clustering and the db monitoring separate module were not needed any longer and removed. Added a separate [fixup] commit to explain why they were not needed any longer.

nickva · 2024-06-20T05:10:14Z

To test the job migration even when the local shard doesn't get updated, I used an N=5 db and then stopped one of the nodes where the job wasn't running. Then noticed that the job correctly migrated to one of the remaining N=3 nodes.

I used this silly script to create the replication job:

#!/bin/bash
# make && ./dev/run --admin=adm:pass -n 5 --with-haproxy

S=localhost:5984
SA=adm:pass@${S}
JSON="Content-Type: application/json"

curl -s -XDELETE $SA/t > /dev/null
curl -s -XDELETE $SA/s > /dev/null
curl -s -XDELETE $SA/_replicator > /dev/null

sleep 1
echo "Creating _replicator db"
curl -f -s -XPUT $SA/_replicator'?n=5' > /dev/null

sleep 1
echo "Creating src and tgt dbs"
curl -f -s -XPUT $SA/t > /dev/null
curl -f -s -XPUT $SA/s > /dev/null

echo "Posting replication job"
curl -f -XPUT $SA/_replicator/r -H "$JSON" \
 -d '{"source":"http:https://adm:pass@localhost:5984/s", "target":"http:https://adm:pass@localhost:5984/t", "worker_processes":1, "continuous":true}'

Cluster started with:

make && ./dev/run --admin=adm:pass -n 5 --with-haproxy

As a result my replication job ended up running on NODE4. I then stopped NODE2 and that altered the hashing used in the ownership algorithm and the new owner becomes NODE1:

NODE4
Replication doc shards/00000000-7fffffff/_replicator.1718859537:r with id {"5fe95cdd2e5718789c106b986a2d8340","+continuous"} usurped by node '[email protected]'
couch_replicator_scheduler: Job {"5fe95cdd2e5718789c106b986a2d8340","+continuous"} stopped as <0.19718.0>

NODE2
([email protected])1> init:stop().
([email protected])2> *** ERROR: Shell process terminated! (^G to start new job) ***

NODE1
Node '[email protected]' is the usurper for doc shards/00000000-7fffffff/_replicator.1718859537:r with id nil
couch_replicator_scheduler: Job {"5fe95cdd2e5718789c106b986a2d8340","+continuous"} started as <0.21919.0>

src/couch_replicator/src/couch_replicator_doc_processor.erl

jaydoane

Way to fix a nasty corner case, and also reduce the amount of code, and complexity at the same time. Brilliant work!

src/couch_replicator/src/couch_replicator_doc_processor.erl

jaydoane · 2024-06-19T18:39:58Z

src/couch_replicator/src/couch_replicator_doc_processor.erl

@@ -108,11 +112,6 @@ get_worker_ref({DbName, DocId}) when is_binary(DbName), is_binary(DocId) ->
 nil
 end.

-% Cluster membership change notification callback


That's a nice cleanup that removes most handle_cast clauses.

src/couch_replicator/src/couch_replicator_doc_processor.erl

Previously, we didn't account for all the possible migration patterns, specifically, when a replication job would migrate from node A to node C, when we move a shard from A to B. The job would stop on node A, on node B the replicator would notice it's not the owner, and ignore the change. Node C, however, would never experience any shard file changes, so it would never know it should start the replication job. To fix the migration logic, pick a naive approach and make the replication doc processor retain the replication records not just for the jobs it owns, but also for the shard copies for which it could be a potential owner. Then, periodically, check ownership, and start/stop jobs accordingly. The difference from before is now we have a new replicator doc processor state: `not_owner`. Docs in this state still track all the doc updates, but otherwise stay inactive: they don't run jobs and they don't update replication documents on errors. Only if the node becomes the owner, it switches the record to `initializing` state and the job starts to run. To avoid unnecessary duplicate job dueling, while other nodes still haven't run their membership checks, the usurped jobs are stopped immediately during a membership check, but those that should run are started with a delay. That means that jobs which migrates, may stop running for a minute or so, and then should start running on another node. Additionally, can remove couch_replicator_db_changes module which used to monitor cluster stability and restart the scanner on node connects and disconnect (membership checks happen in doc processor). The only thing left from couch_replicator_db_changes was the couch_multidb subsription which was moved to the scanner function in the doc_processor. Since we removed couch_replicator_db_changes we also don't need the cluster stability monitoring so that module is also removed. The owner function is moved to the utils module and it's `unstable` return value eliminated. The membership check in the doc processor will stop jobs immediately but delay their startup so we get the same effect as before with the stability check. Previously the couch_replicator_clustering module also applied an initial startup delay to allow the application to start properly before starting to scan for jobs so to retain compatibility that setting was moved to doc processor. The new behavior should be covered by tests in the doc processor. Since the tests were updated, took the opportunity to switch the doc processor tests suites to use `TDEF_FE` macro, which removes few redundant setup lines (begin/end ...).

nickva force-pushed the fix-replicator-job-migration-on-rebalancing branch from 3f4124a to db76759 Compare June 18, 2024 22:52

nickva commented Jun 19, 2024

View reviewed changes

src/couch_replicator/src/couch_replicator_doc_processor.erl Show resolved Hide resolved

nickva commented Jun 19, 2024

View reviewed changes

src/couch_replicator/src/couch_replicator_scheduler.erl Show resolved Hide resolved

nickva force-pushed the fix-replicator-job-migration-on-rebalancing branch from db76759 to 1df2636 Compare June 19, 2024 15:14

nickva commented Jun 19, 2024

View reviewed changes

nickva commented Jun 20, 2024

View reviewed changes

src/couch_replicator/src/couch_replicator_doc_processor.erl Show resolved Hide resolved

jaydoane approved these changes Jun 21, 2024

View reviewed changes

nickva force-pushed the fix-replicator-job-migration-on-rebalancing branch from 2f68d18 to 9c701a6 Compare June 21, 2024 15:47

nickva merged commit a563d05 into main Jun 21, 2024
17 checks passed

nickva deleted the fix-replicator-job-migration-on-rebalancing branch June 21, 2024 16:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure replication jobs migrate when the shard map changes #5093

Ensure replication jobs migrate when the shard map changes #5093

nickva commented Jun 18, 2024 •

edited

Loading

nickva Jun 19, 2024

jaydoane Jun 19, 2024

nickva commented Jun 20, 2024

nickva commented Jun 20, 2024

jaydoane left a comment

jaydoane Jun 19, 2024

Ensure replication jobs migrate when the shard map changes #5093

Ensure replication jobs migrate when the shard map changes #5093

Conversation

nickva commented Jun 18, 2024 • edited Loading

nickva Jun 19, 2024

Choose a reason for hiding this comment

jaydoane Jun 19, 2024

Choose a reason for hiding this comment

nickva commented Jun 20, 2024

nickva commented Jun 20, 2024

jaydoane left a comment

Choose a reason for hiding this comment

jaydoane Jun 19, 2024

Choose a reason for hiding this comment

nickva commented Jun 18, 2024 •

edited

Loading