Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apply random jitter during initial _replicator shard discovery #484

Merged
merged 1 commit into from
May 5, 2017

Conversation

nickva
Copy link
Contributor

@nickva nickva commented Apr 21, 2017

This is bringing back previous code:

DbName = ?l2b(filename:rootname(RelativeFilename, ".couch")),
Jitter = jitter(Acc),
spawn_link(fun() ->
timer:sleep(Jitter),
gen_server:cast(Server, {resume_scan, DbName})
end),
Acc + 1

The rationale is the following: during shard scanning a lot of resume_scan
messages are sent back to back. This causes the replicator manager to open
change feeds for all of those shards. By delaying resume_scan message by
a jitter proportional to the number messages sent to far, it gives replicator
manager a chance to open some change feeds, finish processing them and close
them before newer resume_scan messages arrive.

The random delay average starts 10 msec for first message, up to 1 min for 6000th and higher.
Some sample values:

  • For 100 messages, average wait will be 1 second
  • For 1000 - 10 seconds
  • For 6000 and higher - 1 minute

Jira: COUCHDB-3389

@wohali
Copy link
Member

wohali commented Apr 30, 2017

@nickva Seeing conflicts here, does this still make sense with the scheduler merged?

@nickva
Copy link
Contributor Author

nickva commented May 1, 2017

@wohali you're right this will need to be updated for the scheduling replicator.

This is bringing back previous code:

https://github.com/apache/couchdb/blob/884cf3e55f77ab1a5f26dc7202ce21771062eae6/src/couch_replicator_manager.erl#L940-L946

This is to avoid a stampede during startup when potentially a large number
shards are found and change feeds have to be opened for all of them at the
same time.

The average jitter value starts at 10 msec for first shard, then goes up to
1 minute for 6000th shard and stays clamped at 1 minute afterwards. (Note:
that's the average, the range is 1 -> 2 * average as this is a uniform
random distribution).

Some sample values:

 * 100 - 1 second
 * 1000 - 10 seconds
 * 6000 and higher - 1 minute

Jira: COUCHDB-3389
end.


notify_fold(DbName, {Server, DbSuffix, Count}) ->
Jitter = jitter(Count),
spawn_link(fun() ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to seed random somehow.

Copy link
Contributor Author

@nickva nickva May 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case we don't see it because it runs in the same process. If it was running in side the individually spawned process we'd need to seed it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Thank you for pointing this out.

Copy link
Contributor

@iilyak iilyak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@nickva nickva merged commit 4a63d22 into apache:master May 5, 2017
@nickva nickva deleted the couchdb-3389 branch May 5, 2017 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants