Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add godo's rate limiter configuration & retryable http client #546 #967

Conversation

DanielHLelis
Copy link
Contributor

@DanielHLelis DanielHLelis commented Mar 28, 2023

Hello, as showed in the #546 issue, for medium to large scale projects it is very common to hit the API rate limits, with in some projects it even preventing the execution of Terraform plan at all.

Here I have implementations of two proposed solutions, one implementing a configuration present in the digitalocean/godo client and other from @tback, as shown in the #820 PR.

The first one is using the Static Rate Limit from godo, adding the configuration option to the schema, applying it in the GodoClient. I also have set the default value to 4 req/s (240 req/m) which is just under the API's 250 req/m limit.

The second one is configuring godo to use the Hashicorp's go-retryablehttp, adding the configuration options to the schema (for max tries, min wait and max wait), with a default max tries of 0 (which keeps the default behaviour of no retries), min wait of 1s (which is the package's default) and max wait of 30s (which is also the package's default). In order to avoid confusion during configuration and keep things simpler, I've set the wait arguments to be floats indicating the time in seconds.

Both solutions are present in separate branches of my fork, feature/godo-rate-limiter and feature/retryable-http, with this PR being a merge of both.

Looking foward to help on resolving this issue and open to help on any adjustments.

@DanielHLelis
Copy link
Contributor Author

Fixed linting issues: imports at digitalocean/config/config.go.

@DanielHLelis DanielHLelis force-pushed the feature/godo-rate-limiter-and-http-retryable branch from 9501b6f to e67505b Compare March 28, 2023 18:02
@DanielHLelis
Copy link
Contributor Author

Just noticed that I've accidentally commited the built binnary to git. Made a force push to remove it from history.

@DanielHLelis
Copy link
Contributor Author

FYI: I've removed the default rate limit of 4 req/s since it would slow down the plan of smaller projects that don't reach the rate limit often, it is more useful for large projects that always reach the limit, in other cases the retryable approach feels like a better solution.

Still haven't enabled it by default, but I think it could be a good option, doing something like max 12 tries, min wait of 5s and max wait of 10s, as defaults would cover rate limit cases pretty well without prejudicing much the rarer occasions of server errors (5xx).

@mikeberezowski
Copy link

Thank you for this submission. Our team needs time to review this and review the godo changes needed. We want to ensure we're providing the best experience for our users before we implement backoffs and retries.

@lattwood
Copy link

lattwood commented Apr 5, 2023

@mikeberezowski this issue represents a hard blocker with incredibly unwieldy workarounds, is there any timeline for the review on this, and the review on the godo changes, or would we be better served by reaching out to our Account Executive for that information?

@DanielHLelis
Copy link
Contributor Author

DanielHLelis commented Apr 5, 2023

I've pushed this fork to the Terraform public registry (mainly for personal use, since it's impacting my current setup). Also just updated the docs for the newly added arguments (had a little coherence issue which I haven't notice at first).

Also, as said before, this is a join of two different fixes, if you want I can open a new PR with only one of these (personally, I feel like the retryable option is the best, since it doesn't impact small projects in any way when enabled).

I'm open to help and contribute in any improvements/fixes you find suitable.

By the way, there was no updates to godo itself, I've used a configuration option for throttling that was already available in the client itself. For the retryable, the same thing applies, since godo allows you to pass the desired HTTP client on the constructor.

@andrewsomething
Copy link
Member

@DanielHLelis Thank you so much for putting this pull request together. I'll be giving this a closer review, but after some conversation inside the team we are OK with the general approach here. I think that the ideal long-term solution is to implement something at the godo level as it has more knowledge of the DigitalOcean API and will be generally useful, but it is clear that there is demand for this from the community and we do not have a solid timeline for addressing it in godo.

I agree with leaving this as an opt-in for the moment. This will allow the community to validate the approach, and provide feedback that we can use to eventually enable a solution by default. The configuration options are likely things we would want to expose in any solution we use (http_retry_max, http_retry_wait_min, http_retry_wait_max). So we should be able to transparently replace the solution if needed without breaking existing user configs.

One notable drawback to hashicorp/go-retryablehttp is that it does not understand the ratelimit-reset header used by the DigitalOcean API. It only looks for Retry-After for its backoff implementation. So in some situations, rate limiting will still impact users. Though it should still help to smooth out requests that hit the burst limit of 5% of the hourly total per minute.

We do have some manual reties in the Terraform codebase. Before enabling it by default, something we'd want to better understand is how this interacts with existing user configurable timeouts.

Thanks again!

@DanielHLelis
Copy link
Contributor Author

DanielHLelis commented Apr 5, 2023

Thanks for the response. About the RateLimit-Reset header, it is possible to make the hashicorp/go-retryablehttp understand the header. To do that, we could overwrite the default backoff strategy with a custom one that parses the header, it would look something like this:

func digitalOceanAPIBackoff(min, max time.Duration, attemptNum int, resp *http.Response) time.Duration {
	if resp != nil && resp.StatusCode == http.StatusTooManyRequests {
		// Retrieve API's Rate Limit Reset unix timestamp
		if s, ok := resp.Header["Ratelimit-Reset"]; ok {
			if resetUnix, err := strconv.ParseInt(s[0], 10, 64); err == nil {
				nowUnix := time.Now().Unix()
				sleep := time.Second * time.Duration(resetUnix-nowUnix)

				log.Printf("[INFO] Reached API Rate Limit, waiting: %s seconds", sleep)

				// Cap sleep time to maximum configured value (to prevent too long wait times for mismatched clocks)
				if sleep > max {
					return max
				}

				// Avoid negative sleep times (generally indicates a mismatched clock)
				if sleep > 0 {
					return sleep
				}
			}
		}
	}

	// Fallback to default backoff strategy
	sleep := retryablehttp.DefaultBackoff(min, max, attemptNum, resp)
	log.Printf("[INFO] API Error (not Rate Limit), waiting: %s seconds", sleep)
	return sleep
}

Then we could configure the client to use it by simply overwriting the Backoff attribute like so:

	...
	retryableClient.RetryWaitMax = time.Duration(c.HTTPRetryWaitMax * float64(time.Second))
	retryableClient.Backoff = digitalOceanAPIBackoff // HERE, replaces the default backoff

	client := retryableClient.StandardClient()
	...

The problem with that is the fact that the header returns the epoch timestamp of the first request to expire in the last 1 hour period, making it unsuitable to track the necessary wait time for the triggers at the 1 minute limits. Here is the branch with a demo of the beforementioned code, feature/retryable-http-custom-backoff, testing it I confirmed this behavior.

About the interaction with other retries inside the code, the retryable client only attempt retries on some client-side errors, HTTP 429 and HTTP 5XX (except for 501). Because of that, it shouldn't interact with any other retries present inside the code, as those usually are done in the 4XX range (mainly 412). I also haven't found retries inside the godo code during the quick look I took at it (so I am not 100% sure about that).

digitalocean/config/config.go Outdated Show resolved Hide resolved
digitalocean/provider.go Show resolved Hide resolved
@andrewsomething
Copy link
Member

andrewsomething commented Apr 6, 2023

The problem with that is the fact that the header returns the epoch timestamp of the first request to expire in the last 1 hour period, making it unsuitable to track the necessary wait time for the triggers at the 1 minute limits.

Right. That's part of why there is a somewhat larger project needed on the DigitalOcean side beyond just clients around all of this.

I've kicked the tires on this a bit now, and it is working great to smooth out requests and preventing Terraform from choking on the burst limiting. Let's get this landed! I left a couple small questions/suggestions in line.

@DanielHLelis
Copy link
Contributor Author

DanielHLelis commented Apr 6, 2023

I've implemented the first suggestions about completely avoiding the retryable http client when http_retry_max is not greater than 0.

When it comes to the second suggestion about the floats, I didn't implement it yet. Since Terraform will accept integers on float fields, casting them automatically, it shouldn't be an issue on the user side.
If you still prefer it to be an int, just send me a reply so that I change it here.

@benyanke
Copy link

benyanke commented Apr 8, 2023

Do you think it might make sense to read responses from the headers, to guarantee you stay under the limit?

https://docs.digitalocean.com/reference/api/intro/#rate-limit

you can easily see, by reading the right header, how many requests are left before hitting limits. This could enable the user to hit the max allowable apply speed.

@DanielHLelis
Copy link
Contributor Author

Do you think it might make sense to read responses from the headers, to guarantee you stay under the limit?

https://docs.digitalocean.com/reference/api/intro/#rate-limit

you can easily see, by reading the right header, how many requests are left before hitting limits. This could enable the user to hit the max allowable apply speed.

I've tried it, as suggested by @andrewsomething, (there is even a demo up here in the PR), the problem is that those refer to the overall 1-hour rate limit. The most common issue when running the provider is reaching the burst limit of 250 req/m. For that we'd need different headers conveying the information of the burst limit, which aren't currently available.

Overall, the current solution works out well to smooth out the requests and handle the rate limits, with the retryable approach being the most "optimal" in the sense of overall execution times and the throttle being able to ensure consistency (avoiding completely the rate limit) at the cost of increased execution times.

It can be later adapted to support the new headers if and when they become available.

@benyanke
Copy link

That's a solid point - guess there's not much that can be done without the short-term burst limit.

I like this work overall - I do hope it gets merged. It would make this provider usable again for me.

@andrewsomething andrewsomething self-requested a review April 11, 2023 14:45
@andrewsomething andrewsomething merged commit f66b7e9 into digitalocean:main Apr 11, 2023
@andrewsomething
Copy link
Member

@DanielHLelis Thanks again for all your work on this! We'll be looking to get a release cut with this soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants