Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"source" encoding for datasets opened from fsspec objects #8923

Merged
merged 11 commits into from
Jun 30, 2024

Conversation

keewis
Copy link
Collaborator

@keewis keewis commented Apr 9, 2024

When opening files from path-like objects (str, pathlib.Path), the backend machinery (_dataset_from_backend_dataset) sets the "source" encoding. This is useful if we need the original path for additional processing, like writing to a similarly named file, or to extract additional metadata. This would be useful as well when using fsspec to open remote files.

In this PR, I'm extracting the path attribute that most fsspec objects have to set that value. I've considered using isinstance checks instead of the getattr-with-default, but the list of potential classes is too big to be practical (at least 4 classes just within fsspec itself).

If this sounds like a good idea, I'll update the documentation of the "source" encoding to mention this feature.

@max-sixty
Copy link
Collaborator

Without knowing much (I generally ds.reset_encoding()) it does sound like a good idea!

@Illviljan
Copy link
Contributor

Shouldn't _normalize_path or _find_absolute_paths be able to handle this?

@keewis
Copy link
Collaborator Author

keewis commented Apr 9, 2024

the main use case is indeed to extract additional data, which you'd do immediately after open_dataset (after which you could drop the encoding).

Shouldn't _normalize_path or _find_absolute_paths be able to handle this?

As far as I can tell, they only convert path-likes to string (which these objects are not, they are file-like, not path-like). Are you suggesting we should change that?

@dcherian
Copy link
Contributor

I think this is fine, but our long-term goal is to delete encoding so you might consider a different solution to your problem.

@keewis
Copy link
Collaborator Author

keewis commented Apr 23, 2024

my impression of that discussion was that we wanted to either return the encoding in a separate object, or somehow remove the encoding after the first operation (i.e. not carry it around). Either way would be fine with me, since I would still have access to it immediately after opening.

@dcherian
Copy link
Contributor

Would a dataset with this in encoding be round tripped without error? Would be good to test that

@keewis
Copy link
Collaborator Author

keewis commented Jun 24, 2024

Would a dataset with this in encoding be round tripped without error? Would be good to test that

I'm not opposed to adding an explicit test (since I can't find any existing one right now), but if it would cause problems we'd also have those with string paths / urls – and those have been working just fine since long ago.

As far as I can tell, "source", as well as "original_shape", are dropped from the encoding before doing anything else (search for safe_to_drop for where that happens).

@dcherian
Copy link
Contributor

Ah thanks. My mistake m I thought we were sticking in the fsspec object not just the path

@keewis
Copy link
Collaborator Author

keewis commented Jun 24, 2024

as far as I can tell, we could write anything in that encoding (fsspec objects, strings, or other things), and it would simply be ignored / dropped before writing.

@keewis keewis added the plan to merge Final call for comments label Jun 27, 2024
@dcherian dcherian merged commit caed274 into pydata:main Jun 30, 2024
34 checks passed
@keewis keewis deleted the fsspec-source branch July 3, 2024 22:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

mfdataset - ds.encoding["source"] to retrieve filename not valid key GH2550 revisited
4 participants