Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ObjectStore with_url Should Handle Path #4199

Closed
tustvold opened this issue May 11, 2023 · 7 comments
Closed

ObjectStore with_url Should Handle Path #4199

tustvold opened this issue May 11, 2023 · 7 comments
Assignees
Labels
enhancement Any new improvement worthy of a entry in the changelog

Comments

@tustvold
Copy link
Contributor

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Various builders such as AmazonS3Builder, MicrosoftAzureBuilder, etc.. provide a with_url method.

However, with exception to URL patterns such as https://s3.region.amazonaws.com/bucket which encode the bucket name in the URL, they ignore the path. This is surprising and inconsistent with stores such as HttpStore and LocalFileSystem which have a built-in notion of a prefix.

This can to a certain extent be worked around with PrefixStore, but implementing this logic correctly requires duplicating the logic to understand what parts of a given URL are the prefix

Describe the solution you'd like

I would like the cloud stores to have a with_prefix option, and to populate this within with_url.

Describe alternatives you've considered

Additional context

Relates to #4047

delta-rs has some logic here to handle this, although this will misbehave for URLs where the bucket name is encoded in the path.

@tustvold tustvold added the enhancement Any new improvement worthy of a entry in the changelog label May 11, 2023
@tustvold tustvold self-assigned this May 11, 2023
@tustvold
Copy link
Contributor Author

I actually decided against this, in favor of returning the remaining path from parse_url in #4200

@tustvold tustvold closed this as not planned Won't fix, can't repro, duplicate, stale May 11, 2023
@flokli
Copy link

flokli commented Mar 9, 2024

I stumbled over this today. While parse_url returns the rest of the URL, I need to use the *Builder structs to be able to also allow configurability using environment variables etc (from_env).

Being able to get the path out of an URL is quite messy, I'd rather not reimplement this on my own. Of course I could use parse_url as a "path extractor" and just throw away the also returned Box<dyn ObjectStore> immediately, but it's a bit ugly.

I'd much prefer with_url in all *Builder structs to put the path it found into a field, and providing a function to return it from there.

@tustvold
Copy link
Contributor Author

tustvold commented Mar 9, 2024

All from_env does is look in the environment for configuration keys, you could do likewise and pass what you found to parse_url

@flokli
Copy link

flokli commented Mar 11, 2024

But then what's the point of having the more composable cloud-specific builders in first place?

The contain path parsing logic, being able to extract the leftover path is all that'd be needed to be able to use them for this usecase.

@tustvold
Copy link
Contributor Author

tustvold commented Mar 11, 2024

Perhaps we could make https://github.com/apache/arrow-rs/blob/master/object_store%2Fsrc%2Fparse.rs#L71 public?

But then what's the point of having the more composable cloud-specific builders in first place?

For the use-cases where stores aren't configured by a URL??

@flokli
Copy link

flokli commented Mar 12, 2024

Perhaps we could make https://github.com/apache/arrow-rs/blob/master/object_store%2Fsrc%2Fparse.rs#L71 public?

That'd help. We'd might still parse the URL twice, but that's less of a deal than constructing and throwing away the entire dyn ObjectStore.

But then what's the point of having the more composable cloud-specific builders in first place?

For the use-cases where stores aren't configured by a URL??

Hmmh, I think in most applications you almost definitely want to have a combination of both. s3:https://my-bucket/some-subpath works as a pretty good identifier to specify the protocol, bucket name and subpath where something is located, so that's something people exchange in a configuration file.

However, "how to get there" usually is very specific to the environment - a local developer on their local machine might have a static access key pair, or some AWS SSO config, while production workloads might use k8s and IAM roles for service accounts (env vars), or IAM roles for EC2 provided by the instance metadata server (implicit).

IMHO, there's a lot of value in object_store having the same semantics as the "official" cloud-provider SDKs, even if the users of object_store use parse_url. Which would mean, respecting cloud-provider-specific means of configuration (env vars, ambient metadata servers) by default, and at least having the real "path" part of a URL prominently accessible.

Leaving it up to the crate users to manually start parsing env vars into options passed to parse_url_opts will almost definitely lead to most users not doing that.

@tustvold
Copy link
Contributor Author

tustvold commented Mar 12, 2024

Hmmh, I think in most applications you almost definitely want to have a combination of both. s3:https://my-bucket/some-subpath works as a pretty good identifier to specify the protocol, bucket name and subpath where something is located, so that's something people exchange in a configuration file.

Whilst I agree that this is something people exchange, the challenge is when people then start creating object stores per path, instead of per bucket. This has been a frequent source of throughput issues people have run into, as connection pooling, credential caching, etc... are at the store level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

No branches or pull requests

2 participants