Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

upgrade pyarrow from 12 to 14 #2089

Merged
merged 4 commits into from
Nov 10, 2023
Merged

upgrade pyarrow from 12 to 14 #2089

merged 4 commits into from
Nov 10, 2023

Conversation

severo
Copy link
Collaborator

@severo severo commented Nov 10, 2023

We temporarily stop supporting array features until datasets fixes the issue on their side.

@severo severo requested review from albertvillanova and a team November 10, 2023 09:28
Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I think you forgot to update it in some of the components: api, rows,...

@codecov-commenter
Copy link

codecov-commenter commented Nov 10, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (bc8bdcb) 90.49% compared to head (e26ac1c) 87.05%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2089      +/-   ##
==========================================
- Coverage   90.49%   87.05%   -3.45%     
==========================================
  Files         248      132     -116     
  Lines       15202     5245    -9957     
==========================================
- Hits        13757     4566    -9191     
+ Misses       1445      679     -766     
Flag Coverage Δ
jobs_cache_maintenance 95.30% <ø> (ø)
jobs_mongodb_migration 86.69% <ø> (ø)
libs_libapi 83.37% <ø> (?)
libs_libcommon ?
services_admin 87.63% <ø> (ø)
services_api 87.09% <ø> (ø)
services_rows 85.49% <ø> (ø)
services_search 80.48% <ø> (ø)
services_sse-api 94.21% <ø> (ø)
services_worker ?

Flags with carried forward coverage won't be shown. Click here to find out more.

see 142 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@severo
Copy link
Collaborator Author

severo commented Nov 10, 2023

I think you forgot to update it in some of the components: api, rows,...

in fact, they were requiring pyarrow 11, which is why it was stuck to a lower version.

But... it seems like the upgrade will not be that simple: https://github.com/huggingface/datasets-server/actions/runs/6823050840/job/18556219132

Looking at it

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But... it seems like the upgrade will not be that simple

I see... I'm having a look at datasets side...

EDIT:
I wrongly understood that it may be caused by datasets, but that is not the case: in datasets we already support pyarrow-14 (I did the PR 😅 ):

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, the issue is in our viewer_utils...

@severo
Copy link
Collaborator Author

severo commented Nov 10, 2023

yes, to reproduce:

import numpy as np
from datasets import (Array2D, Dataset, Features)
feature_type = Array2D(shape=(2, 2), dtype="float32")
content = np.zeros((2, 2), dtype="float32")
features = Features({"col": feature_type})
dataset = Dataset.from_dict({"col": [content]}, features=features)

same if we create the content with

content = [[0.0, 0.0], [0.0, 0.0]]

The trace is:

/home/slesage/hf/datasets-server/libs/libcommon/.venv/lib/python3.9/site-packages/datasets/features/features.py:648: FutureWarning: pyarrow.PyExtensionType is deprecated and will refuse deserialization by default. Instead, please derive from pyarrow.ExtensionType and implement your own serialization mechanism.
  pa.PyExtensionType.__init__(self, self.storage_dtype)
/home/slesage/hf/datasets-server/libs/libcommon/.venv/lib/python3.9/site-packages/datasets/features/features.py:1661: RuntimeWarning: pickle-based deserialization of pyarrow.PyExtensionType subclasses is disabled by default; if you only ingest trusted data files, you may re-enable this using `pyarrow.PyExtensionType.set_auto_load(True)`.
In the future, Python-defined extension subclasses should derive from pyarrow.ExtensionType (not pyarrow.PyExtensionType) and implement their own serialization mechanism.

  obj = {field.name: generate_from_arrow_type(field.type) for field in pa_schema}
/home/slesage/hf/datasets-server/libs/libcommon/.venv/lib/python3.9/site-packages/datasets/features/features.py:1661: FutureWarning: pyarrow.PyExtensionType is deprecated and will refuse deserialization by default. Instead, please derive from pyarrow.ExtensionType and implement your own serialization mechanism.
  obj = {field.name: generate_from_arrow_type(field.type) for field in pa_schema}
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/slesage/hf/datasets-server/libs/libcommon/.venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 924, in from_dict
    return cls(pa_table, info=info, split=split)
  File "/home/slesage/hf/datasets-server/libs/libcommon/.venv/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 693, in __init__
    inferred_features = Features.from_arrow_schema(arrow_table.schema)
  File "/home/slesage/hf/datasets-server/libs/libcommon/.venv/lib/python3.9/site-packages/datasets/features/features.py", line 1661, in from_arrow_schema
    obj = {field.name: generate_from_arrow_type(field.type) for field in pa_schema}
  File "/home/slesage/hf/datasets-server/libs/libcommon/.venv/lib/python3.9/site-packages/datasets/features/features.py", line 1661, in <dictcomp>
    obj = {field.name: generate_from_arrow_type(field.type) for field in pa_schema}
  File "/home/slesage/hf/datasets-server/libs/libcommon/.venv/lib/python3.9/site-packages/datasets/features/features.py", line 1381, in generate_from_arrow_type
    return Value(dtype=_arrow_to_datasets_dtype(pa_type))
  File "/home/slesage/hf/datasets-server/libs/libcommon/.venv/lib/python3.9/site-packages/datasets/features/features.py", line 111, in _arrow_to_datasets_dtype
    raise ValueError(f"Arrow type {arrow_type} does not have a datasets dtype equivalent.")
ValueError: Arrow type extension<arrow.py_extension_type<pyarrow.lib.UnknownExtensionType>> does not have a datasets dtype equivalent.

@severo
Copy link
Collaborator Author

severo commented Nov 10, 2023

Should I open an issue in datasets? Or am I misusing it? It's a bit too cryptic for me

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I open an issue in datasets?

Yes, I understand now: they have disabled their PyExtensionType and we use it in datasets for arrays...

@severo
Copy link
Collaborator Author

severo commented Nov 10, 2023

OK, I opened huggingface/datasets#6396

@severo severo added the blocked-by-upstream The issue must be fixed in a dependency label Nov 10, 2023
severo added a commit that referenced this pull request Nov 10, 2023
@severo
Copy link
Collaborator Author

severo commented Nov 10, 2023

Once merged, we'll have to revert 0f58abb if #2090 is merged before.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reactivity !
I'm fine with disabling ArrayND types for now using 14.0.1

@severo
Copy link
Collaborator Author

severo commented Nov 10, 2023

I'm fine with disabling ArrayND types for now using 14.0.1

what does it mean? letting the error propagate, and remove it from the tests?

@lhoestq
Copy link
Member

lhoestq commented Nov 10, 2023

Let's comment the failing tests for now. I guess pyarrow will ship a fix that introduces safe extension arrays soon

@lhoestq
Copy link
Member

lhoestq commented Nov 10, 2023

(or maybe they won't, but in this case datasets should stop using PyExtensionType, let's discuss in the datasets issue)

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we are no longer supporting datasets with arrays features...

@lhoestq
Copy link
Member

lhoestq commented Nov 10, 2023

Yes until datasets fixes it

@severo
Copy link
Collaborator Author

severo commented Nov 10, 2023

Exactly. I think that it's better to remove support for array features than to let the vulnerability.

@severo severo merged commit 8a6c727 into main Nov 10, 2023
22 checks passed
@severo severo deleted the upgrade-pyarrow branch November 10, 2023 13:34
@severo
Copy link
Collaborator Author

severo commented Nov 10, 2023

Thanks for the comments and reviews!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants