Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable retries on wider classes of errors in object_store #5383

Closed
wants to merge 1 commit into from

Conversation

kszlim
Copy link
Contributor

@kszlim kszlim commented Feb 10, 2024

Which issue does this PR close?

Closes #.

Rationale for this change

When connection is reset by peer, retries are ignored and your request fails. We should probably enable retry on these errors so that there is more robustness when using object_store for OLAP queries. We do not have access to the underlying Kind of the hyper error (https://github.com/hyperium/hyper/blob/0.14.x/src/error.rs#L19), so this is the best way of increasing the cases on which to retry on. This might be overly aggressive, though shouldn't be unsafe.

What changes are included in this PR?

Now we retry by default unless it's one of:

  • Parse error
  • Is user error
  • Or is cancelled

Are there any user-facing changes?

No api changes, but will retry on a strict superset of errors that it used to retry on.

@@ -263,7 +262,7 @@ impl RetryExt for reqwest::RequestBuilder {
do_retry = true
} else if let Some(source) = e.source() {
if let Some(e) = source.downcast_ref::<hyper::Error>() {
if e.is_connect() || e.is_closed() || e.is_incomplete_message() {
if !(e.is_parse() || e.is_parse_status() || e.is_parse_too_large() || e.is_user() || e.is_canceled()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be made an allowlist, we should only retry errors we know are safe to retry, otherwise we could end up retrying until we time out

Copy link
Contributor Author

@kszlim kszlim Feb 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean like it was previously? The problem I'm encountering is that there are no conditions exposed in hyper for https://docs.rs/hyper/latest/src/hyper/error.rs.html#478.

I totally agree with you that this code is much more aggressive, though there seems to be no other way.

I think what can possibly justify this code is that we can't retry indefinitely since the retry limit has to be set by the caller of this code? I agree it's not ideal, though it seems like it's going to be quite a long time away before hyper will expose any more fine grained information about errors.

I'm happy to hear any other suggestions though.

Does it make sense having two tiers of errors, one with a lower retry limit for unknown sorts of errors that get caught by the conditions I inserted, and one with a higher limit (which was what the original code caught)? It's definitely uglier for consumers though.

Otherwise it might make sense having polars implement RetryExt and then perhaps gating this RetryExt implementation behind an optional feature? @ritchie46 curious what your thoughts are, I have this PR open to resolve pola-rs/polars#14384 which I almost invariably encounter when querying a large enough dataset in s3 (from an ec2 instance).

Copy link
Contributor

@tustvold tustvold Feb 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't help feeling if hyper doesn't expose this error variant, it perhaps isn't one we should be handling... I do wonder if the issue is that polars is being too aggressive with IO, have you tried using LimitStore?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use tokio semaphore with a budget of 10.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide a link to how this is wired up, is it possible the semaphore is stalling in-flight IO?

Copy link
Contributor

@tustvold tustvold Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait for the potential fix in polars, so that we have some more information on the nature of the issue you are running into. If the fix works, the issue is contention related, and retrying is entirely the wrong thing to do

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the PR upstream that should restrict more concurrency: pola-rs/polars#14435

Copy link
Contributor Author

@kszlim kszlim Feb 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why you think it's a rate limit issue, I thought s3 should 503 on rate limit as documented here?
https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html

Based off:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-design-patterns.html#optimizing-performance-timeouts-retries (Though this might also imply that object_store should have a configurable retry strategy on a per cloud provider basis?)

It seems like these errors are probably caused by an underlying issue in object_store/hyper/reqwest/polars usage of object_store?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's only if your request actually gets to S3 and you trip then trip its limits, if you flood the network, e.g. by creating hundreds of TCP connections to handle hundreds of concurrent HTTP/1.1 requests, there are various components in the network that may just start dropping these connections on the floor.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't help feeling if hyper doesn't expose this error variant, it perhaps isn't one we should be handling.

Perhaps @seanmonstar could clarify why those error kind types aren't exposed?

I'd agree with that feeling. Anything accessed through .source() of a hyper error is not guaranteed (we may change internal dependencies). Exposing specific error classes is just something we haven't really designed (help welcome).

(No opinion otherwise on this PR, other than the general be careful with retrying unknown errors, especially if it could have meant side effects on the server.)

@tustvold
Copy link
Contributor

Marking this as a draft as pending further investigation

@tustvold
Copy link
Contributor

tustvold commented Mar 6, 2024

The conclusion of pola-rs/polars#14598 appears to be that this was an upstream issue, so closing this. Feel free to reopen if I am mistaken

@tustvold tustvold closed this Mar 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
object-store Object Store Interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Should object store retry on connection reset by peer?
4 participants