Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for Polars structs #10943

Merged
merged 6 commits into from
Nov 10, 2023
Merged

Conversation

ayax79
Copy link
Contributor

@ayax79 ayax79 commented Nov 4, 2023

Provides support for reading Polars structs. This allows opening of supported files (jsonl, parquet, etc) that contain rows with structured data.

The following attached json lines file(receipts.jsonl.gz) contains a customer column with structured data. This json lines file can now be loaded via dfr open and will render as follows:

Screenshot 2023-11-09 at 10 09 18

This also addresses some cleanup of date handling and utilizing timezones where provided.

This pull request only addresses reading data from polars structs. I will address converting nushell data to polars structs in a future request as this change is large enough as it is.

@fdncred
Copy link
Collaborator

fdncred commented Nov 4, 2023

Oh, this looks interesting. I'm excited to see where it goes.

@ayax79 ayax79 marked this pull request as ready for review November 9, 2023 02:18
@fdncred
Copy link
Collaborator

fdncred commented Nov 9, 2023

Would you mind adding some examples of how this would be used, not in the code necessarily but just in the description? Maybe this is just for opening things like parquet files that support all the polars data types?

Also, is this ready to land after review?

@ayax79
Copy link
Contributor Author

ayax79 commented Nov 9, 2023

Would you mind adding some examples of how this would be used, not in the code necessarily but just in the description? Maybe this is just for opening things like parquet files that support all the polars data types?

Also, is this ready to land after review?

Yes, this is ready for review. I updated the description. Does that work?

Copy link
Collaborator

@fdncred fdncred left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested and it works as expected. I'm good with these changes.

@fdncred fdncred merged commit fe92051 into nushell:main Nov 10, 2023
20 checks passed
@fdncred
Copy link
Collaborator

fdncred commented Nov 10, 2023

Thanks!

@ayax79 ayax79 deleted the dataframe-struct-support branch November 25, 2023 00:49
Comment on lines +1043 to +1061
#[inline]
fn arr_to_value(
dt: &DataType,
arr: &dyn Array,
idx: usize,
span: Span,
) -> Result<Value, ShellError> {
macro_rules! downcast {
($casttype:ident) => {{
let arr = &*(arr as *const dyn Array as *const $casttype);
arr.value_unchecked(idx)
}};
}

// Not loving the unsafe here, however this largely based off the one
// example I found for converting Array values in:
// polars_core::chunked_array::ops::any_value::arr_to_any_value
unsafe {
match dt {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the safety condition depends on the correct DataType being passed, this function itself has to be unsafe.

Even if we declare it as such I really find it hard to validate that this whole operation is sound.
The call sites of this function need to guarantee both a valid type is ascribed and that all requirements with regards to the pointer and index are met.

Beyond that I am not super comfortable with a manul type reinterpretation that depends on the data representation as given to us by an external crate in this case both polars_core and arrow2. The former has been quite willing to break APIs so it feels to me like a ticking time bomb to make safety assumptions across crates outside of a simpler well documented interface.
If I have to read the crate source code to reassure me that this is sound and can't just read the documentation of a constrained unsafe API I think this is just asking for future trouble.

sholderbach added a commit that referenced this pull request Nov 27, 2023
sholderbach added a commit that referenced this pull request Nov 29, 2023
Reverts #10943

The current implementation of `arr_to_value` is unsound, as it allows
casts of arbitrary data to arbitrary types without being marked
`unsafe`.
The full safety requirements to perform both the cast and the following
unchecked access are not as clear that a simple change of `fn
arr_to_value` to `unsafe fn arr_to_value` could be blessed without
further investigation.

cc @ayax79
hardfau1t pushed a commit to hardfau1t/nushell that referenced this pull request Dec 14, 2023
Provides support for reading Polars structs. This allows opening of
supported files (jsonl, parquet, etc) that contain rows with structured
data.

The following attached json lines
file([receipts.jsonl.gz](https://github.com/nushell/nushell/files/13311476/receipts.jsonl.gz))
contains a customer column with structured data. This json lines file
can now be loaded via `dfr open` and will render as follows:

<img width="525" alt="Screenshot 2023-11-09 at 10 09 18"
src="https://github.com/nushell/nushell/assets/56345/4b26ccdc-c230-43ae-a8d5-8af88a1b72de">


This also addresses some cleanup of date handling and utilizing
timezones where provided.

This pull request only addresses reading data from polars structs. I
will address converting nushell data to polars structs in a future
request as this change is large enough as it is.

---------

Co-authored-by: Jack Wright <[email protected]>
hardfau1t pushed a commit to hardfau1t/nushell that referenced this pull request Dec 14, 2023
Reverts nushell#10943

The current implementation of `arr_to_value` is unsound, as it allows
casts of arbitrary data to arbitrary types without being marked
`unsafe`.
The full safety requirements to perform both the cast and the following
unchecked access are not as clear that a simple change of `fn
arr_to_value` to `unsafe fn arr_to_value` could be blessed without
further investigation.

cc @ayax79
dmatos2012 pushed a commit to dmatos2012/nushell that referenced this pull request Feb 20, 2024
Provides support for reading Polars structs. This allows opening of
supported files (jsonl, parquet, etc) that contain rows with structured
data.

The following attached json lines
file([receipts.jsonl.gz](https://github.com/nushell/nushell/files/13311476/receipts.jsonl.gz))
contains a customer column with structured data. This json lines file
can now be loaded via `dfr open` and will render as follows:

<img width="525" alt="Screenshot 2023-11-09 at 10 09 18"
src="https://github.com/nushell/nushell/assets/56345/4b26ccdc-c230-43ae-a8d5-8af88a1b72de">


This also addresses some cleanup of date handling and utilizing
timezones where provided.

This pull request only addresses reading data from polars structs. I
will address converting nushell data to polars structs in a future
request as this change is large enough as it is.

---------

Co-authored-by: Jack Wright <[email protected]>
dmatos2012 pushed a commit to dmatos2012/nushell that referenced this pull request Feb 20, 2024
Reverts nushell#10943

The current implementation of `arr_to_value` is unsound, as it allows
casts of arbitrary data to arbitrary types without being marked
`unsafe`.
The full safety requirements to perform both the cast and the following
unchecked access are not as clear that a simple change of `fn
arr_to_value` to `unsafe fn arr_to_value` could be blessed without
further investigation.

cc @ayax79
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants