-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Apply random jitter during initial _replicator shard discovery #484
Conversation
@nickva Seeing conflicts here, does this still make sense with the scheduler merged? |
@wohali you're right this will need to be updated for the scheduling replicator. |
This is bringing back previous code: https://github.com/apache/couchdb/blob/884cf3e55f77ab1a5f26dc7202ce21771062eae6/src/couch_replicator_manager.erl#L940-L946 This is to avoid a stampede during startup when potentially a large number shards are found and change feeds have to be opened for all of them at the same time. The average jitter value starts at 10 msec for first shard, then goes up to 1 minute for 6000th shard and stays clamped at 1 minute afterwards. (Note: that's the average, the range is 1 -> 2 * average as this is a uniform random distribution). Some sample values: * 100 - 1 second * 1000 - 10 seconds * 6000 and higher - 1 minute Jira: COUCHDB-3389
end. | ||
|
||
|
||
notify_fold(DbName, {Server, DbSuffix, Count}) -> | ||
Jitter = jitter(Count), | ||
spawn_link(fun() -> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to seed random somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case we don't see it because it runs in the same process. If it was running in side the individually spawned process we'd need to seed it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Thank you for pointing this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
This is bringing back previous code:
couchdb/src/couch_replicator_manager.erl
Lines 940 to 946 in 884cf3e
The rationale is the following: during shard scanning a lot of
resume_scan
messages are sent back to back. This causes the replicator manager to open
change feeds for all of those shards. By delaying
resume_scan
message bya jitter proportional to the number messages sent to far, it gives replicator
manager a chance to open some change feeds, finish processing them and close
them before newer resume_scan messages arrive.
The random delay average starts 10 msec for first message, up to 1 min for 6000th and higher.
Some sample values:
Jira: COUCHDB-3389