Enable retries on wider classes of errors in object_store #5383

kszlim · 2024-02-10T18:51:26Z

Which issue does this PR close?

Closes #.

Rationale for this change

When connection is reset by peer, retries are ignored and your request fails. We should probably enable retry on these errors so that there is more robustness when using object_store for OLAP queries. We do not have access to the underlying Kind of the hyper error (https://github.com/hyperium/hyper/blob/0.14.x/src/error.rs#L19), so this is the best way of increasing the cases on which to retry on. This might be overly aggressive, though shouldn't be unsafe.

What changes are included in this PR?

Now we retry by default unless it's one of:

Parse error
Is user error
Or is cancelled

Are there any user-facing changes?

No api changes, but will retry on a strict superset of errors that it used to retry on.

tustvold · 2024-02-11T02:08:39Z

object_store/src/client/retry.rs

@@ -263,7 +262,7 @@ impl RetryExt for reqwest::RequestBuilder {
 do_retry = true
 } else if let Some(source) = e.source() {
 if let Some(e) = source.downcast_ref::<hyper::Error>() {
- if e.is_connect() || e.is_closed() || e.is_incomplete_message() {
+ if !(e.is_parse() || e.is_parse_status() || e.is_parse_too_large() || e.is_user() || e.is_canceled()) {


Can this be made an allowlist, we should only retry errors we know are safe to retry, otherwise we could end up retrying until we time out

You mean like it was previously? The problem I'm encountering is that there are no conditions exposed in hyper for https://docs.rs/hyper/latest/src/hyper/error.rs.html#478.

I totally agree with you that this code is much more aggressive, though there seems to be no other way.

I think what can possibly justify this code is that we can't retry indefinitely since the retry limit has to be set by the caller of this code? I agree it's not ideal, though it seems like it's going to be quite a long time away before hyper will expose any more fine grained information about errors.

I'm happy to hear any other suggestions though.

Does it make sense having two tiers of errors, one with a lower retry limit for unknown sorts of errors that get caught by the conditions I inserted, and one with a higher limit (which was what the original code caught)? It's definitely uglier for consumers though.

Otherwise it might make sense having polars implement RetryExt and then perhaps gating this RetryExt implementation behind an optional feature? @ritchie46 curious what your thoughts are, I have this PR open to resolve pola-rs/polars#14384 which I almost invariably encounter when querying a large enough dataset in s3 (from an ec2 instance).

I can't help feeling if hyper doesn't expose this error variant, it perhaps isn't one we should be handling... I do wonder if the issue is that polars is being too aggressive with IO, have you tried using LimitStore?

We use tokio semaphore with a budget of 10.

Could you provide a link to how this is wired up, is it possible the semaphore is stalling in-flight IO?

Let's wait for the potential fix in polars, so that we have some more information on the nature of the issue you are running into. If the fix works, the issue is contention related, and retrying is entirely the wrong thing to do

Here is the PR upstream that should restrict more concurrency: pola-rs/polars#14435

I'm curious why you think it's a rate limit issue, I thought s3 should 503 on rate limit as documented here?
https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance.html

Based off:
https://docs.aws.amazon.com/AmazonS3/latest/userguide/optimizing-performance-design-patterns.html#optimizing-performance-timeouts-retries (Though this might also imply that object_store should have a configurable retry strategy on a per cloud provider basis?)

It seems like these errors are probably caused by an underlying issue in object_store/hyper/reqwest/polars usage of object_store?

That's only if your request actually gets to S3 and you trip then trip its limits, if you flood the network, e.g. by creating hundreds of TCP connections to handle hundreds of concurrent HTTP/1.1 requests, there are various components in the network that may just start dropping these connections on the floor.

I can't help feeling if hyper doesn't expose this error variant, it perhaps isn't one we should be handling.

Perhaps @seanmonstar could clarify why those error kind types aren't exposed?

I'd agree with that feeling. Anything accessed through .source() of a hyper error is not guaranteed (we may change internal dependencies). Exposing specific error classes is just something we haven't really designed (help welcome).

(No opinion otherwise on this PR, other than the general be careful with retrying unknown errors, especially if it could have meant side effects on the server.)

tustvold · 2024-02-20T04:34:37Z

Marking this as a draft as pending further investigation

tustvold · 2024-03-06T03:37:51Z

The conclusion of pola-rs/polars#14598 appears to be that this was an upstream issue, so closing this. Feel free to reopen if I am mistaken

Enable retries on wider classes of errors in object_store

d002936

github-actions bot added the object-store Object Store Interface label Feb 10, 2024

kszlim mentioned this pull request Feb 10, 2024

Should object store retry on connection reset by peer? #5378

Closed

tustvold reviewed Feb 11, 2024

View reviewed changes

wjones127 mentioned this pull request Feb 19, 2024

Connection reset by peer when uploading to S3 with image column lancedb/lance#1948

Open

tustvold marked this pull request as draft February 20, 2024 04:33

tustvold closed this Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable retries on wider classes of errors in object_store #5383

Enable retries on wider classes of errors in object_store #5383

kszlim commented Feb 10, 2024 •

edited

Loading

tustvold Feb 11, 2024

kszlim Feb 11, 2024 •

edited

Loading

tustvold Feb 11, 2024 •

edited

Loading

ritchie46 Feb 11, 2024

tustvold Feb 11, 2024

tustvold Feb 12, 2024 •

edited

Loading

ritchie46 Feb 12, 2024

kszlim Feb 12, 2024 •

edited

Loading

tustvold Feb 12, 2024

seanmonstar Feb 14, 2024

tustvold commented Feb 20, 2024

tustvold commented Mar 6, 2024

Enable retries on wider classes of errors in object_store #5383

Enable retries on wider classes of errors in object_store #5383

Conversation

kszlim commented Feb 10, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

tustvold Feb 11, 2024

Choose a reason for hiding this comment

kszlim Feb 11, 2024 • edited Loading

Choose a reason for hiding this comment

tustvold Feb 11, 2024 • edited Loading

Choose a reason for hiding this comment

ritchie46 Feb 11, 2024

Choose a reason for hiding this comment

tustvold Feb 11, 2024

Choose a reason for hiding this comment

tustvold Feb 12, 2024 • edited Loading

Choose a reason for hiding this comment

ritchie46 Feb 12, 2024

Choose a reason for hiding this comment

kszlim Feb 12, 2024 • edited Loading

Choose a reason for hiding this comment

tustvold Feb 12, 2024

Choose a reason for hiding this comment

seanmonstar Feb 14, 2024

Choose a reason for hiding this comment

tustvold commented Feb 20, 2024

tustvold commented Mar 6, 2024

kszlim commented Feb 10, 2024 •

edited

Loading

kszlim Feb 11, 2024 •

edited

Loading

tustvold Feb 11, 2024 •

edited

Loading

tustvold Feb 12, 2024 •

edited

Loading

kszlim Feb 12, 2024 •

edited

Loading