Skip to content
This repository has been archived by the owner on Mar 5, 2024. It is now read-only.

Allow custom session lifetime #41

Merged
merged 1 commit into from Mar 9, 2018
Merged

Allow custom session lifetime #41

merged 1 commit into from Mar 9, 2018

Conversation

ghost
Copy link

@ghost ghost commented Mar 1, 2018

Previous hardcoded session lifetime of 15m caused issues with libraries
like botocore, which have a hard set limit of refreshing token a set time
before expiry. In botocore the default happens to be 15 minutes, which
means that the session token is never cached within botocore.

This causes excess load on kiam and quite often leads to missing access
keys on the service using the tokens.

This PR adds a flag to define the lifetime of the token that is
requested from AWS API.

@pingles
Copy link
Contributor

pingles commented Mar 1, 2018

Thanks for adding this! I'm going to add a few remarks to your changes if that's ok- keen to have it merged but a couple of tweaks to make please.

@ghost
Copy link
Author

ghost commented Mar 2, 2018

Sure thing. It was made in a bit of a haste since this was devastating our infra.. So now that everything is stable again, there is time to make modifications 😄

@@ -54,6 +54,7 @@ func main() {
kingpin.Flag("role-base-arn", "Base ARN for roles. e.g. arn:aws:iam::123456789:role/").StringVar(&serverConfig.RoleBaseARN)
kingpin.Flag("role-base-arn-autodetect", "Use EC2 metadata service to detect ARN prefix.").BoolVar(&serverConfig.AutoDetectBaseARN)
kingpin.Flag("session", "Session name used when creating STS Tokens.").Default("kiam").StringVar(&serverConfig.SessionName)
kingpin.Flag("session-lifetime", "Requested session duration for STS Tokens.").Default("15m").DurationVar(&serverConfig.SessionDuration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rename the flag to session-duration given it's typed as a duration and the other variable names use that (I prefer duration too :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, lets use duration here too. The reason why I initially opted for lifetime is that duration sounds like it's measuring something rather than requesting something. In the sake of consistency, I agree that it should be duration.

DefaultCredentialsValidityPeriod = 15 * time.Minute
DefaultCacheTTL = 10 * time.Minute
DefaultPurgeInterval = 1 * time.Minute
DefaultCacheTTL = 10 * time.Minute
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the DefaultCacheTTL should now be defined relative to the cache duration. We want to make sure it's expired from the cache a number of minutes before the session duration- that way the cache has a chance to fetch new credentials ahead of the client needing them

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably just needs to be something like CacheTTL - (2 * time.Minute)

Copy link
Author

@ghost ghost Mar 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually brings us to a more fundamental problem. Since we explicitly store everything at DefaultCacheTTL, what if we request a token for 5 hours and AWS returns us a token of 1 hour? This will make the cached item stay in the cache for the additional 4 hours..

The only way I can see to do this is first resolve the credentials, then store it in the cache using the returned session duration. This means that the futures concept in the caching layer becomes pretty much obsolete.

I don't understand if there is some deeper meaning for the futures and why it's actually there.. Could you give me some context so I can figure out how to proceed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point about the durations being controlled by AWS. We should probably rename most of those to be something like requestedTTL to indicate it's what we're asking, rather than the actual duration of the token.

The future is there to prevent us requesting credentials for the same role multiple times- since access to the cache is under a Mutex if multiple clients request credentials for the same role they'll be given references to the same Future, so they can all block on the same request for credentials and complete when the credentials are returned. It's somewhat of an optimisation but means that the number of AWS calls reduces significantly.

I'm not sure if there's any guarantees from AWS about the duration- if the operation can only succeed when credentials are returned with the requested TTL then I'd leave things mostly as-is. However, if STS can disregard what you request and override the duration (so the expiry on the returned response is the only thing that matters) then we'll probably need to re-jig things a bit.

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also be interested to hear @Joseph-Irving thoughts on this :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just had a discussion here about the caching and during that discussion I realised that it's actually better if we keep the cache TTL 10 minutes. That will spread out the token renewals for the pods themselves.

Let's assume that we renew the token 2 minutes before the token expiration. this means that for 13 minutes all botocore based applications will refresh tokens constantly prior to refresh. Even if that isn't the case, this will cause a storm of requests that all coincide with the token expiration within kiam, effectively causing spikes in load for kiam. Refreshing cached tokens every 10 minutes will still reduce the load on AWS API quite significantly, while avoiding request storms.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe you can safely set the cache, retrieve and then set again- the mutex applies within those functions and not around them so we'd need other synchronisation to happen in your example code above.

Is your suggestion then that we'd keep all cache expiries/refreshes to 10 minutes, but that credentials would be valid for longer- so clients would refresh them less frequently against kiam but the kiam server would be interacting with AWS every 10 minutes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache does support a Replace operation (and could probably be updated to support modifying an item's expiry) that has the correct synchronisation.

To summarise, it feels like the options are:

  1. Change the cache to allow expiries to be updated/changed (either with a change to the lib, or using the Replace function. Cache expiry would be set to Expiry - (time.Minute * jitter) to ensure its refreshed ahead of client's requesting credentials.
  2. Allow sessions to have configurable durations but maintain a fixed refresh interval (via the cache expiration) of 10 minutes.

Perhaps the safest thing for now is option 2 (which is pretty much what you originally contributed via this PR). It means that --session-duration can't be used to reduce load on the AWS API but would allow clients to need credentials less frequently. 1 is probably better but I'd want to be sure that it was safe under load.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm currently in favour of option 2. My original concern wasn't to reduce load on AWS api, but rather on kiam (and the system doing API calls through kiam). The problem we encountered was to do with latency. Every request received by the botocore backed python app caused 4 HTTP requests to be triggered to request instance credentials. Increasing the session duration caused the caching within botocore to hold the keys in memory and reduced the overhead from credential fetching.

I'll update the PR with the naming changes only for now.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also a related issue was with credential fetching where kiam was unable to return credentials to the app, but I'll try and figure out the root cause for this later on.

Previous hardcoded session lifetime of 15m caused issues with libraries
like botocore, which have a hard set limit of refreshing token a set time
before expiry. In botocore the default happens to be 15 minutes, which
means that the session token is never cached within botocore.

This causes excess load on kiam and quite often leads to missing access
keys on the service using the tokens.

This commit adds a flag to define the lifetime of the token that is
requested from AWS API.
@pingles pingles merged commit 0e23bcb into uswitch:master Mar 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant