Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Raise error when Ray Data tensor cannot be pickled and disable compliant nested types #2428

Merged
merged 12 commits into from
Jun 22, 2024

Conversation

kevinzwang
Copy link
Member

@kevinzwang kevinzwang commented Jun 21, 2024

This solves some of our flaky unit tests that try to read a Ray dataset tensor type under a Pyarrow version that does not properly serialize it. Adds a new exception when we detect this issue, and also updates our pyarrow dependencies to 13.0.0.

I also found a separate bug upon upgrading to pyarrow, where versions >= 13.0.0 will use compliant Parquet nested type names by default (see apache/arrow#35146), which would then cause a discrepancy between the Parquet and Arrow schemas for extension types. This would lead to incorrect conversion from Parquet to Arrow when we would read a file. This PR disables that explicitly.

Finally I also added a coerce_temporal_nanoseconds parameter to to_pandas to revert it to its pre pyarrow>=13.0.0 behavior, which makes the sql integration tests pass again

Confirmed to raise proper error in local testing

@github-actions github-actions bot added the bug Something isn't working label Jun 21, 2024
@kevinzwang kevinzwang changed the title [BUG] Raise better error when Ray Data tensor cannot be pickled [BUG] Raise better error when Ray Data tensor cannot be pickled and use compliant Parquet nested types Jun 21, 2024
@kevinzwang kevinzwang changed the title [BUG] Raise better error when Ray Data tensor cannot be pickled and use compliant Parquet nested types [BUG] Detect when Ray Data tensor cannot be pickled and use compliant Parquet nested types Jun 21, 2024
@kevinzwang kevinzwang changed the title [BUG] Detect when Ray Data tensor cannot be pickled and use compliant Parquet nested types [BUG] Detect when Ray Data tensor cannot be pickled and disable compliant nested types Jun 21, 2024
@kevinzwang kevinzwang changed the title [BUG] Detect when Ray Data tensor cannot be pickled and disable compliant nested types [BUG] Raise error when Ray Data tensor cannot be pickled and disable compliant nested types Jun 21, 2024
Copy link

codecov bot commented Jun 21, 2024

Codecov Report

Attention: Patch coverage is 75.00000% with 6 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@4951842). Learn more about missing BASE report.

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #2428   +/-   ##
=======================================
  Coverage        ?   63.03%           
=======================================
  Files           ?      939           
  Lines           ?   108079           
  Branches        ?        0           
=======================================
  Hits            ?    68125           
  Misses          ?    39954           
  Partials        ?        0           
Files Coverage Δ
daft/runners/partitioning.py 81.33% <ø> (ø)
daft/table/micropartition.py 91.07% <100.00%> (ø)
daft/table/table.py 58.17% <100.00%> (ø)
daft/table/table_io.py 88.99% <100.00%> (ø)
daft/dataframe/dataframe.py 88.80% <40.00%> (ø)

Copy link
Contributor

@jaychia jaychia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@kevinzwang kevinzwang merged commit b7e6e41 into main Jun 22, 2024
47 checks passed
@kevinzwang kevinzwang deleted the kevin/ray-data-pyarrow-serialization branch June 22, 2024 00:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants