Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache jitter #123

Merged
merged 7 commits into from
Mar 7, 2021
Merged

Cache jitter #123

merged 7 commits into from
Mar 7, 2021

Conversation

PierreF
Copy link
Contributor

@PierreF PierreF commented Nov 13, 2020

Port of jpmens/mosquitto-auth-plug#315

Add a jitter to the expiration delay, which avoid lots of entry to expire at the same time. This could easily occur when the Mosquitto server is restarted: all clients get disconnected and will all reconnect quickly. This will cause a storm of lookups (can't avoid the first one, at least not without Redis). But without any jitter, every acl_cacheseconds a new lookup storm will happen.

This PR is based on #122 because otherwise a conflict will occur.

README.md Outdated
Comment on lines 278 to 284
The `auth_jitter_seconds` and `acl_jitter_seconds` allow to randomize the expiration used. The value used
is cache_seconds +/- jitter_seconds. With above values (30 seconds for cache and 3 seconds for jitter), the
expiration will be between 27 seconds and 33 seconds. The jitter is useful to reduce loopups storm that could
occur every auth/acl_cache_seconds if lots of clients connected at the same time (for example
after a server restart, all your clients may reconnect immediatly creating lots of entry expiring at the same time, unless
a jitter is used). You can set jitter to 0 to disable this feature.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `auth_jitter_seconds` and `acl_jitter_seconds` allow to randomize the expiration used. The value used
is cache_seconds +/- jitter_seconds. With above values (30 seconds for cache and 3 seconds for jitter), the
expiration will be between 27 seconds and 33 seconds. The jitter is useful to reduce loopups storm that could
occur every auth/acl_cache_seconds if lots of clients connected at the same time (for example
after a server restart, all your clients may reconnect immediatly creating lots of entry expiring at the same time, unless
a jitter is used). You can set jitter to 0 to disable this feature.
`auth_jitter_seconds` and `acl_jitter_seconds` options allow to randomize cache expiration time by a given offset
The value used for expiring a cache record would then be `cache_seconds` +/- `jitter_seconds`. With above values (30 seconds for cache and 3 seconds for jitter), effective expiration would yield any value between 27 and 33 seconds.
Setting a `jitter` value is useful to reduce lookups storms that could occur every `auth/acl_cache_seconds` if lots of clients connected at the same time, e.g. after a server restart when all clients may reconnect immediately creating lots of entries expiring at the same time.
You may omit or set jitter options to 0 to disable this feature.

Besides the suggestion, this would only be true when refreshing the cache on startup, as a non refreshed cache wouldn't suffer from a horde of clients reconnecting, right? I think trusting the user to not set jitter when having a persistent cache is good enough, but do you think we should have some no_jitter default in that case?

Copy link
Contributor Author

@PierreF PierreF Mar 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR try to fix the following use case:

  • You had hundred of clients connected
  • You restart your mosquitto (like for upgrade).
  • You don't have persistent cache
  • All client got disconnected. All client reconnect (immediately in my case - at least within the few next seconds)
  • All clients will trigger auth & acl checks at the same time. Since no persistent cache, it will storms the auth/acl backend. We can't avoid this one (without persistent cache).
  • But without jitter, the cache expiration of all entry will be at the same time. Causing another storming of the backend on entries expiration (at least if client active enough, in my case the sent every 10 seconds).
  • And this will continue every acl_cache_seconds

So:

  • Yes this happend only if Mosquitto is restarted and mosquitto-go-auth don't have persistent cache
  • (actually it happend also if for some reason lots of client connect at the same time - and stay connected)
  • It don't try to fix the first storming, only subsequent
  • With persistent cache, this jitter is likely not useful.

I'm not sure about what you means by "no_jitter default" ? Do you just want that default value of jitter are 0 ? (which means no effect)

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was if e.g. the cache is persistent and not set to flush on restart, then we would default to no jitter even if the users provided one. But after some thought it may not be such a good idea.

cache/cache.go Outdated Show resolved Hide resolved
cache/cache.go Show resolved Hide resolved
cache/cache.go Show resolved Hide resolved
cache/cache_test.go Outdated Show resolved Hide resolved
Comment on lines 12 to 13
// Since expirationWithJitter use random, to multiple time to ensure
// result is within expected boundary
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Since expirationWithJitter use random, to multiple time to ensure
// result is within expected boundary
/* Since expirationWithJitter randomizes the expirtaion time, and instructions take a random time already,
we need to make room for tests.
*/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to understand both your and my sentence :D
I likely did a typo and wanted to write "Since expirationWithJitter use random, DO multiple time [...]" or "Since expirationWithJitter use random, test multiple time [...]".

What I wanted to said it that since the result is randomized, we can't test for exact value but only that result is within a range. Doing multiple tests ensure that all results are within the range.

What about this version ?:

// Since expirationWithJitter randomizes the expirtaion time, do test multiple times and check that result is within expected range

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, gotcha! Yeah, the room I mentioned was more about testing to the limit since we don't know how much the instructions take. I get what you were saying now and your new comment makes sense to me.

cache/cache_test.go Outdated Show resolved Hide resolved
go-auth.go Outdated
@@ -374,6 +374,8 @@ func setCache(authOpts map[string]string) {

var aclCacheSeconds int64 = 30
var authCacheSeconds int64 = 30
var authJitterSeconds int64 = 3
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As almost always, this one's on me: jitters are only known to cache, there's no need to have them processed in the main package. Then again, acl/authCacheSeconds shouldn't be there either, but refactoring cache to process the option and extract the duration is totally out of scope for this PR, so leave it in main and we can take care of it later.

TL;DR: never mind this comment. And note to self: don't start open-source projects when you've only used the language for a couple of weeks, you'll regret it many years later when making/reviewing changes (and have a desire to burn the whole ugly, bad-designed thing down while we are at it). 🤦‍♀️

Copy link
Owner

@iegomez iegomez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost done, just one fix and a discussion on default jitters on parsing error.

cache/cache.go Outdated Show resolved Hide resolved
if err == nil {
authJitterSeconds = authSec
} else {
log.Warningf("couldn't parse authJitterSeconds (err: %s), defaulting to %d", err, authJitterSeconds)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of odd because I would assume that passing a wrong value would default to the same as omitting the option, i.e. 0, not 3. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either I'm missing something or I don't understand the first part of your comment.
For me (and my understanding of the code), the value used if option is omitted or in case of wrong value is 3.
Because in both case, we don't reach line 392 (authJitterSeconds = authSec) so we use using the default value set few line above.

I do agree that both omitted and wrong value must use the same default/fallback value.

I don't mind whether this default is 0 or 3. I've choose a default of jitter enabled because I think it only had benefit.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're right that both are setting it to 3, I got thrown off by the docs which state that not passing a jitter would prevent the feature, i.e. the value should be 0, no 3. And if we're setting the default to 0, then I'd expect a parse error to default to 0 too.
Based on the principle of least surprise and if we wish to honor the docs, I think it makes sense to not have a jitter when not explicitly set by the user, and thus the default should be 0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the default to 0.

if err == nil {
aclJitterSeconds = aclSec
} else {
log.Warningf("couldn't parse aclJitterSeconds (err: %s), defaulting to %d", err, aclJitterSeconds)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@PierreF PierreF merged commit 82ca3fc into iegomez:master Mar 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants