Make multidb changes shard map aware #4962

nickva · 2024-01-13T06:43:13Z

couch_multidb_changes module is monitoring shards whose names match a particular suffix, and notifies users with found, updated and deleted events. This is the module which drives replicator jobs when */_replicator databases are updated.

Previously, couch_multidb_changes reacted only to node-local shard file events and was not aware of the shard map membership of those files. This was mostly evident during shard moves: the target shard could be created long before the
shard file becomes a part of the shard map. The replicator could notice the new target shard file and spawn a replication job on the new node, but keep the same replication job running on the source node. The two replication jobs will eventually conflict in the PG system (https://www.erlang.org/doc/man/pg.html) and one of them would start crashing with a "duplicate job" error. This could last days depending on how long it would take to populate the data on the target. Even after recovery, the target shard could be backed-off up to another extra 8 hours until it may run again.

To avoid issues like that, make couch_multidb_changes aware of shard map membership updates. When a shard file is discovered, and it is not in the shard map, mark it with a wait_shard_map = true flag. Then, re-use the existing db event monitoring mechanism to notice when shards db is updated, and schedule a delayed membership check for the shards tracked in our ETS table.

Other changes to the module are mostly cosmetic:

Remove the unused created callback. db_found is used instead, both when dbs are created, and during startup when they are discovered.
In the ETS table use a proper #row{} record since we now have 5 items in the tuple. This simplifies some of the existing code as well.
During deletion and creation, actually delete the entries from the ETS table. Previously we didn't do it so the would hang around forever until the node was restarted.
Add comments to a few tricky sections explaining what should be happening there.
Add more tests, both the old and new functionality. Increase coverage from 96% to 98%.

This callback is not used. We rely on db_found callback to discover databases. That happens both when they are created, as well as when they are found during the initial scanning on startup.

jaydoane

Very nice cleanup on some super complex code!

src/couch/src/couch_multidb_changes.erl

couch_multidb_changes module is monitoring shards whose names match a particular suffix and notifies users with found, updated and deleted events. This is the module which drives replicator jobs when `*/_replicator` database are updated. Previously, couch_multidb_changes reacted only to node-local shard file events and was not aware of the shard map membership of those files. This discrepancy was mostly evident during shard moves: the target shard could be created long before the shard file becomes a part of the shard map. The replicator could notice the new target shard file and spawn a replication job on the new node, but keep the same replication job running on the source node. The two replication jobs will eventually conflict in the PG system (https://www.erlang.org/doc/man/pg.html) and one of them would start crashing with a "duplicate job" error. This could last days depending on how long it would take to populate the data on the target. Even after recovery, the target shard could be backed-off up to another extra 8 hours until it may run again. To avoid issues like that, make couch_multidb_changes aware of shard map membership updates. When the a shard file is discovered, and it is not in the shard map, mark it with a `wait_shard_map = true` flag. Then, re-use the existing db event monitoring mechanism to notice when shards db itself is updated, and schedule a delayed membership check for the shards tracked in our ETS table. Other changes to the module are mostly cosmetic: * In the ETS table use a proper `#row{}` record since we now have 5 items in the tuple. This simplifies some of the existing code as well. * During deletion and creation, actually delete the entries from the ETS table. Previously we didn't do it so the would hang around forever until the node was restarted. * Add comments to a few tricky sections explaining what should be happening there. * Add more tests, both the old and new functionality. Increase coverage from 96% to 98%.

nickva · 2024-01-17T22:18:47Z

After manually trying a few shard moves locally, the default scheduler interval seemed a bit long. It would be nice on average to react a bit quicker to shard map changes so decided to use {shards_db_check_msec, Interval div 2} instead of {[skip_ddocs, {dbs_check_msec, Interval}}. This way the shard map checks happens at least once before each scheduling interval.

Removed unused dbs_created couch_multidb_changes callback

d433bb8

This callback is not used. We rely on db_found callback to discover databases. That happens both when they are created, as well as when they are found during the initial scanning on startup.

nickva force-pushed the make-multidb-changes-shard-map-aware branch from 0ac55d9 to 3f027d2 Compare January 16, 2024 21:05

jaydoane approved these changes Jan 17, 2024

View reviewed changes

src/couch/src/couch_multidb_changes.erl Outdated Show resolved Hide resolved

src/couch/src/couch_multidb_changes.erl Outdated Show resolved Hide resolved

nickva force-pushed the make-multidb-changes-shard-map-aware branch from 3f027d2 to e4bd63c Compare January 17, 2024 16:18

nickva force-pushed the make-multidb-changes-shard-map-aware branch from e4bd63c to 6fdbbb7 Compare January 17, 2024 21:31

nickva merged commit 4de4086 into main Jan 18, 2024
15 checks passed

nickva deleted the make-multidb-changes-shard-map-aware branch January 18, 2024 16:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make multidb changes shard map aware #4962

Make multidb changes shard map aware #4962

nickva commented Jan 13, 2024 •

edited

Loading

jaydoane left a comment

nickva commented Jan 17, 2024

Make multidb changes shard map aware #4962

Make multidb changes shard map aware #4962

Conversation

nickva commented Jan 13, 2024 • edited Loading

jaydoane left a comment

Choose a reason for hiding this comment

nickva commented Jan 17, 2024

nickva commented Jan 13, 2024 •

edited

Loading