Allow custom session lifetime #41

ghost · 2018-03-01T06:33:40Z

Previous hardcoded session lifetime of 15m caused issues with libraries
like botocore, which have a hard set limit of refreshing token a set time
before expiry. In botocore the default happens to be 15 minutes, which
means that the session token is never cached within botocore.

This causes excess load on kiam and quite often leads to missing access
keys on the service using the tokens.

This PR adds a flag to define the lifetime of the token that is
requested from AWS API.

pingles · 2018-03-01T15:49:11Z

Thanks for adding this! I'm going to add a few remarks to your changes if that's ok- keen to have it merged but a couple of tweaks to make please.

ghost · 2018-03-02T02:05:03Z

Sure thing. It was made in a bit of a haste since this was devastating our infra.. So now that everything is stable again, there is time to make modifications 😄

pingles · 2018-03-01T15:50:00Z

cmd/server/main.go

@@ -54,6 +54,7 @@ func main() {
 kingpin.Flag("role-base-arn", "Base ARN for roles. e.g. arn:aws:iam::123456789:role/").StringVar(&serverConfig.RoleBaseARN)
 kingpin.Flag("role-base-arn-autodetect", "Use EC2 metadata service to detect ARN prefix.").BoolVar(&serverConfig.AutoDetectBaseARN)
 kingpin.Flag("session", "Session name used when creating STS Tokens.").Default("kiam").StringVar(&serverConfig.SessionName)
+ kingpin.Flag("session-lifetime", "Requested session duration for STS Tokens.").Default("15m").DurationVar(&serverConfig.SessionDuration)


I'd rename the flag to session-duration given it's typed as a duration and the other variable names use that (I prefer duration too :)

Ok, lets use duration here too. The reason why I initially opted for lifetime is that duration sounds like it's measuring something rather than requesting something. In the sake of consistency, I agree that it should be duration.

pingles · 2018-03-01T15:51:28Z

pkg/aws/sts/cache.go

- DefaultCredentialsValidityPeriod = 15 * time.Minute
- DefaultCacheTTL = 10 * time.Minute
+ DefaultPurgeInterval = 1 * time.Minute
+ DefaultCacheTTL = 10 * time.Minute


I think the DefaultCacheTTL should now be defined relative to the cache duration. We want to make sure it's expired from the cache a number of minutes before the session duration- that way the cache has a chance to fetch new credentials ahead of the client needing them

It probably just needs to be something like CacheTTL - (2 * time.Minute)

This actually brings us to a more fundamental problem. Since we explicitly store everything at DefaultCacheTTL, what if we request a token for 5 hours and AWS returns us a token of 1 hour? This will make the cached item stay in the cache for the additional 4 hours..

The only way I can see to do this is first resolve the credentials, then store it in the cache using the returned session duration. This means that the futures concept in the caching layer becomes pretty much obsolete.

I don't understand if there is some deeper meaning for the futures and why it's actually there.. Could you give me some context so I can figure out how to proceed?

Good point about the durations being controlled by AWS. We should probably rename most of those to be something like requestedTTL to indicate it's what we're asking, rather than the actual duration of the token.

The future is there to prevent us requesting credentials for the same role multiple times- since access to the cache is under a Mutex if multiple clients request credentials for the same role they'll be given references to the same Future, so they can all block on the same request for credentials and complete when the credentials are returned. It's somewhat of an optimisation but means that the number of AWS calls reduces significantly.

I'm not sure if there's any guarantees from AWS about the duration- if the operation can only succeed when credentials are returned with the requested TTL then I'd leave things mostly as-is. However, if STS can disregard what you request and override the duration (so the expiry on the returned response is the only thing that matters) then we'll probably need to re-jig things a bit.

What do you think?

Also be interested to hear @Joseph-Irving thoughts on this :)

We just had a discussion here about the caching and during that discussion I realised that it's actually better if we keep the cache TTL 10 minutes. That will spread out the token renewals for the pods themselves.

Let's assume that we renew the token 2 minutes before the token expiration. this means that for 13 minutes all botocore based applications will refresh tokens constantly prior to refresh. Even if that isn't the case, this will cause a storm of requests that all coincide with the token expiration within kiam, effectively causing spikes in load for kiam. Refreshing cached tokens every 10 minutes will still reduce the load on AWS API quite significantly, while avoiding request storms.

I don't believe you can safely set the cache, retrieve and then set again- the mutex applies within those functions and not around them so we'd need other synchronisation to happen in your example code above.

Is your suggestion then that we'd keep all cache expiries/refreshes to 10 minutes, but that credentials would be valid for longer- so clients would refresh them less frequently against kiam but the kiam server would be interacting with AWS every 10 minutes?

The cache does support a Replace operation (and could probably be updated to support modifying an item's expiry) that has the correct synchronisation.

To summarise, it feels like the options are:

Change the cache to allow expiries to be updated/changed (either with a change to the lib, or using the Replace function. Cache expiry would be set to Expiry - (time.Minute * jitter) to ensure its refreshed ahead of client's requesting credentials.

Allow sessions to have configurable durations but maintain a fixed refresh interval (via the cache expiration) of 10 minutes.

Perhaps the safest thing for now is option 2 (which is pretty much what you originally contributed via this PR). It means that --session-duration can't be used to reduce load on the AWS API but would allow clients to need credentials less frequently. 1 is probably better but I'd want to be sure that it was safe under load.

I'm currently in favour of option 2. My original concern wasn't to reduce load on AWS api, but rather on kiam (and the system doing API calls through kiam). The problem we encountered was to do with latency. Every request received by the botocore backed python app caused 4 HTTP requests to be triggered to request instance credentials. Increasing the session duration caused the caching within botocore to hold the keys in memory and reduced the overhead from credential fetching.

I'll update the PR with the naming changes only for now.

Also a related issue was with credential fetching where kiam was unable to return credentials to the app, but I'll try and figure out the root cause for this later on.

Previous hardcoded session lifetime of 15m caused issues with libraries like botocore, which have a hard set limit of refreshing token a set time before expiry. In botocore the default happens to be 15 minutes, which means that the session token is never cached within botocore. This causes excess load on kiam and quite often leads to missing access keys on the service using the tokens. This commit adds a flag to define the lifetime of the token that is requested from AWS API.

pingles reviewed Mar 7, 2018

View reviewed changes

pingles merged commit 0e23bcb into uswitch:master Mar 9, 2018

pingles mentioned this pull request Mar 9, 2018

Allow configurable cache TTL #44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow custom session lifetime #41

Allow custom session lifetime #41

ghost commented Mar 1, 2018

pingles commented Mar 1, 2018

ghost commented Mar 2, 2018

pingles Mar 1, 2018

ghost Mar 8, 2018

pingles Mar 1, 2018

pingles Mar 7, 2018

ghost Mar 8, 2018 •

edited by ghost

Loading

pingles Mar 8, 2018

pingles Mar 8, 2018

ghost Mar 8, 2018

pingles Mar 8, 2018

pingles Mar 8, 2018

ghost Mar 9, 2018

ghost Mar 9, 2018

Allow custom session lifetime #41

Allow custom session lifetime #41

Conversation

ghost commented Mar 1, 2018

pingles commented Mar 1, 2018

ghost commented Mar 2, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost Mar 8, 2018 • edited by ghost Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghost Mar 8, 2018 •

edited by ghost

Loading