-
Notifications
You must be signed in to change notification settings - Fork 165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache jitter #123
Cache jitter #123
Conversation
The auth expiration was used.
README.md
Outdated
The `auth_jitter_seconds` and `acl_jitter_seconds` allow to randomize the expiration used. The value used | ||
is cache_seconds +/- jitter_seconds. With above values (30 seconds for cache and 3 seconds for jitter), the | ||
expiration will be between 27 seconds and 33 seconds. The jitter is useful to reduce loopups storm that could | ||
occur every auth/acl_cache_seconds if lots of clients connected at the same time (for example | ||
after a server restart, all your clients may reconnect immediatly creating lots of entry expiring at the same time, unless | ||
a jitter is used). You can set jitter to 0 to disable this feature. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The `auth_jitter_seconds` and `acl_jitter_seconds` allow to randomize the expiration used. The value used | |
is cache_seconds +/- jitter_seconds. With above values (30 seconds for cache and 3 seconds for jitter), the | |
expiration will be between 27 seconds and 33 seconds. The jitter is useful to reduce loopups storm that could | |
occur every auth/acl_cache_seconds if lots of clients connected at the same time (for example | |
after a server restart, all your clients may reconnect immediatly creating lots of entry expiring at the same time, unless | |
a jitter is used). You can set jitter to 0 to disable this feature. | |
`auth_jitter_seconds` and `acl_jitter_seconds` options allow to randomize cache expiration time by a given offset | |
The value used for expiring a cache record would then be `cache_seconds` +/- `jitter_seconds`. With above values (30 seconds for cache and 3 seconds for jitter), effective expiration would yield any value between 27 and 33 seconds. | |
Setting a `jitter` value is useful to reduce lookups storms that could occur every `auth/acl_cache_seconds` if lots of clients connected at the same time, e.g. after a server restart when all clients may reconnect immediately creating lots of entries expiring at the same time. | |
You may omit or set jitter options to 0 to disable this feature. | |
Besides the suggestion, this would only be true when refreshing the cache on startup, as a non refreshed cache wouldn't suffer from a horde of clients reconnecting, right? I think trusting the user to not set jitter
when having a persistent cache is good enough, but do you think we should have some no_jitter
default in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR try to fix the following use case:
- You had hundred of clients connected
- You restart your mosquitto (like for upgrade).
- You don't have persistent cache
- All client got disconnected. All client reconnect (immediately in my case - at least within the few next seconds)
- All clients will trigger auth & acl checks at the same time. Since no persistent cache, it will storms the auth/acl backend. We can't avoid this one (without persistent cache).
- But without jitter, the cache expiration of all entry will be at the same time. Causing another storming of the backend on entries expiration (at least if client active enough, in my case the sent every 10 seconds).
- And this will continue every acl_cache_seconds
So:
- Yes this happend only if Mosquitto is restarted and mosquitto-go-auth don't have persistent cache
- (actually it happend also if for some reason lots of client connect at the same time - and stay connected)
- It don't try to fix the first storming, only subsequent
- With persistent cache, this jitter is likely not useful.
I'm not sure about what you means by "no_jitter
default" ? Do you just want that default value of jitter are 0 ? (which means no effect)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I meant was if e.g. the cache is persistent and not set to flush on restart, then we would default to no jitter even if the users provided one. But after some thought it may not be such a good idea.
cache/cache_test.go
Outdated
// Since expirationWithJitter use random, to multiple time to ensure | ||
// result is within expected boundary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Since expirationWithJitter use random, to multiple time to ensure | |
// result is within expected boundary | |
/* Since expirationWithJitter randomizes the expirtaion time, and instructions take a random time already, | |
we need to make room for tests. | |
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to understand both your and my sentence :D
I likely did a typo and wanted to write "Since expirationWithJitter use random, DO multiple time [...]" or "Since expirationWithJitter use random, test multiple time [...]".
What I wanted to said it that since the result is randomized, we can't test for exact value but only that result is within a range. Doing multiple tests ensure that all results are within the range.
What about this version ?:
// Since expirationWithJitter randomizes the expirtaion time, do test multiple times and check that result is within expected range
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, gotcha! Yeah, the room
I mentioned was more about testing to the limit since we don't know how much the instructions take. I get what you were saying now and your new comment makes sense to me.
go-auth.go
Outdated
@@ -374,6 +374,8 @@ func setCache(authOpts map[string]string) { | |||
|
|||
var aclCacheSeconds int64 = 30 | |||
var authCacheSeconds int64 = 30 | |||
var authJitterSeconds int64 = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As almost always, this one's on me: jitters
are only known to cache
, there's no need to have them processed in the main
package. Then again, acl/authCacheSeconds
shouldn't be there either, but refactoring cache
to process the option and extract the duration is totally out of scope for this PR, so leave it in main
and we can take care of it later.
TL;DR: never mind this comment. And note to self: don't start open-source projects when you've only used the language for a couple of weeks, you'll regret it many years later when making/reviewing changes (and have a desire to burn the whole ugly, bad-designed thing down while we are at it). 🤦♀️
Co-authored-by: Ignacio Gómez <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost done, just one fix and a discussion on default jitters on parsing error.
if err == nil { | ||
authJitterSeconds = authSec | ||
} else { | ||
log.Warningf("couldn't parse authJitterSeconds (err: %s), defaulting to %d", err, authJitterSeconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of odd because I would assume that passing a wrong value would default to the same as omitting the option, i.e. 0, not 3. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Either I'm missing something or I don't understand the first part of your comment.
For me (and my understanding of the code), the value used if option is omitted or in case of wrong value is 3.
Because in both case, we don't reach line 392 (authJitterSeconds = authSec
) so we use using the default value set few line above.
I do agree that both omitted and wrong value must use the same default/fallback value.
I don't mind whether this default is 0 or 3. I've choose a default of jitter enabled because I think it only had benefit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, you're right that both are setting it to 3, I got thrown off by the docs which state that not passing a jitter would prevent the feature, i.e. the value should be 0, no 3. And if we're setting the default to 0, then I'd expect a parse error to default to 0 too.
Based on the principle of least surprise and if we wish to honor the docs, I think it makes sense to not have a jitter when not explicitly set by the user, and thus the default should be 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the default to 0.
if err == nil { | ||
aclJitterSeconds = aclSec | ||
} else { | ||
log.Warningf("couldn't parse aclJitterSeconds (err: %s), defaulting to %d", err, aclJitterSeconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
Co-authored-by: Ignacio Gómez <[email protected]>
Port of jpmens/mosquitto-auth-plug#315
Add a jitter to the expiration delay, which avoid lots of entry to expire at the same time. This could easily occur when the Mosquitto server is restarted: all clients get disconnected and will all reconnect quickly. This will cause a storm of lookups (can't avoid the first one, at least not without Redis). But without any jitter, every acl_cacheseconds a new lookup storm will happen.
This PR is based on #122 because otherwise a conflict will occur.