Should object store retry on connection reset by peer? #5378

kszlim · 2024-02-09T06:50:29Z

Which part is this question about
Object store

Describe your question
Should object store's retry logic also retry when the connection is reset by a peer?

Additional context
I'm querying an s3 bucket via polars (which uses object store under the hood) and I'm encountering this issue, this only happens when I'm querying many (~10k files)

Generic S3 error: Error after 0 retries in 221.182308ms, max_retries:10, retry_timeout:10s, source:error sending request for url (https://s3.us-west-2.amazonaws.com/bucket/some_parquet_id.parquet): connection error: Connection reset by peer (os error 104)

This seems to be because this error isn't covered under object store's retry policy.

It'd be very handy if this was, though I'm not certain if it'd be strictly correct behavior (ie. Should it be handled by the caller of object store instead? Which in this case is polars).

The text was updated successfully, but these errors were encountered:

tustvold · 2024-02-09T12:14:19Z

We probably could retry, but connection reset normally means you are hitting rate limits and should reduce the amount of concurrent IO you are performing. There is a LimitStore that might achieve this, if polars can't do this itself

kszlim · 2024-02-09T18:25:10Z

means you are hitting rate limits and should reduce the amount of concurrent IO you are performing. There is a LimitStore

Hmm, I've tried reducing concurrency in polars (to 32) and it doesn't seem to fix this, though I doubt i'm anywhere near the stated rate limit of s3 which is 5500 GET requests per partitioned prefix per second.

If it's not incorrect behavior, do you think it makes sense to also build this condition into the retries done by object store @tustvold ?

kszlim · 2024-02-10T01:33:35Z

Do you think it'd be alright to modify:
https://github.com/apache/arrow-rs/blob/master/object_store/src/client/retry.rs#L266

to:
!(e.is_parse() || e.is_parse_status() || e.is_parse_too_large() || e.is_user() || e.is_canceled()) Note the negation.

Which would be much more aggressive wrt what states to retry on? Ideally the error Kind in hyper would be publically exposed to allow deeper introspection into failure modes that can be retried, but that doesn't seem like it's going to be on the roadmap for a while (see: hyperium/hyper#2845).

Whilst this might be more aggressive than desired, I think users could accordingly adjust their retries/backoff config to compensate, what do you think?

kszlim · 2024-02-10T18:52:39Z

Opened a PR #5383, feel free to close it if you think it's unreasonable.

tustvold · 2024-03-06T03:37:12Z

The conclusion of pola-rs/polars#14598 appears to be that this was an upstream issue, so closing this. Feel free to reopen if I am mistaken

kszlim added the question Further information is requested label Feb 9, 2024

This was referenced Feb 10, 2024

Enable retries on connection reset by peer when doing scan_parquet against an object store pola-rs/polars#14384

Closed

Enable retries on wider classes of errors in object_store #5383

Closed

wjones127 mentioned this issue Feb 19, 2024

Connection reset by peer when uploading to S3 with image column lancedb/lance#1948

Open

tustvold closed this as not planned Won't fix, can't repro, duplicate, stale Mar 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should object store retry on connection reset by peer? #5378

Should object store retry on connection reset by peer? #5378

kszlim commented Feb 9, 2024

tustvold commented Feb 9, 2024

kszlim commented Feb 9, 2024

kszlim commented Feb 10, 2024 •

edited

Loading

kszlim commented Feb 10, 2024

tustvold commented Mar 6, 2024

Should object store retry on connection reset by peer? #5378

Should object store retry on connection reset by peer? #5378

Comments

kszlim commented Feb 9, 2024

tustvold commented Feb 9, 2024

kszlim commented Feb 9, 2024

kszlim commented Feb 10, 2024 • edited Loading

kszlim commented Feb 10, 2024

tustvold commented Mar 6, 2024

kszlim commented Feb 10, 2024 •

edited

Loading