Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pyarrow: Check compatibility of pyarrow-backed pandas objects with numeric dtypes #2774

Merged
merged 16 commits into from
Dec 16, 2023

Conversation

weiji14
Copy link
Member

@weiji14 weiji14 commented Oct 29, 2023

Description of proposed changes

Pandas is moving towards using PyArrow backed arrays (previously just numpy), and will require pyarrow in pandas 3.0, so we should ensure that pyarrow dtypes work across PyGMT.

This Pull Request installs pyarrow as an optional dependency, and updates some unit tests to check both numpy and pyarrow-backed dtypes for pandas.Series and pandas.DataFrame objects.

TODO:

  • Check regular numeric dtypes (uint/int/float)
  • [ ] Check null and bool types ?
  • [ ] Check string dtypes
  • [ ] Check datetime dtypes

Relates to #1318 (comment), part of #2800

Reminders

  • Run make format and make check to make sure the code follows the style guide.
  • Add tests for new features or tests that would have caught the bug that you're fixing.
  • Add new public functions/methods/classes to doc/api/index.rst.
  • Write detailed docstrings for all functions/methods.
  • If wrapping a new module, open a 'Wrap new GMT module' issue and submit reasonably-sized PRs.
  • If adding new functionality, add an example to docstrings or tutorials.
  • Use underscores (not hyphens) in names of Python files and directories.

Slash Commands

You can write slash commands (/command) in the first line of a comment to perform
specific operations. Supported slash commands are:

  • /format: automatically format and lint the code
  • /test-gmt-dev: run full tests on the latest GMT development version

Install pyarrow as an optional dependency, and check that pandas.Series objects backed by pyarrow dtypes (e.g. 'uint8[pyarrow]') can be read by virtualfile_from_vectors.
@weiji14 weiji14 added the maintenance Boring but important stuff for the core devs label Oct 29, 2023
@weiji14 weiji14 added this to the 0.11.0 milestone Oct 29, 2023
@weiji14 weiji14 self-assigned this Oct 29, 2023
Remove duck-typing that didn't work, and use a proper try-except `import pyarrow`.
@weiji14
Copy link
Member Author

weiji14 commented Nov 7, 2023

  • Check string dtypes
  • Check datetime dtypes

Ok, these two dtypes are not as trivial to support in PyGMT as I thought, because we have a lot of converters that assume numpy-backed arrays. Gonna open a proper issue for this - #2800

@weiji14 weiji14 changed the title WIP: Check compatibility with pyarrow-backed pandas objects WIP: Check compatibility of numeric dtypes with pyarrow-backed pandas objects Nov 7, 2023
Check that pandas.Series and pandas.DataFrame objects backed by pyarrow dtypes (e.g. 'int64[pyarrow]' and 'float64[pyarrow]') can be read by pygmt.info. Using pandas.util._test_decorators_skip_if_no pytest mark to simplify the pytest parametrize.
Geopandas doesn't support casting to pyarrow dtypes like 'int32[pyarrow]' and 'int64[pyarrow]' yet, but adding an xfail test so that we don't forget to test in the future.
Actually, casting to pyarrow integer dtypes work, but writing to the temporary OGR_GMT file doesn't.
pygmt/tests/test_info.py Outdated Show resolved Hide resolved
Comment on lines 145 to 162
pytest.param(
"int32[pyarrow]",
marks=[
td.skip_if_no(package="pyarrow"),
pytest.mark.xfail(
reason="geopandas doesn't support writing columns with pyarrow dtypes to OGR_GMT yet."
),
],
),
pytest.param(
"int64[pyarrow]",
marks=[
td.skip_if_no(package="pyarrow"),
pytest.mark.xfail(
reason="geopandas doesn't support writing columns with pyarrow dtypes to OGR_GMT yet."
),
],
),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So geopandas.GeoDataFrame objects with columns that have pyarrow dtypes can't be plotted in PyGMT yet, because we write them to a temporary OGR_GMT file, and geopandas doesn't support writing int32[pyarrow] or int64[pyarrow] columns. We'll either need to wait for geopandas to support this, or switch PyGMT away from using temporay OGR_GMT files.

Ensure that previous and future versions of GMT are compatible with PyArrow too.
Mention that PyGMT does have some initial support of Pandas objects backed by PyArrow-dtype arrays, but only uint/int/float dtypes for now.
@weiji14 weiji14 marked this pull request as ready for review December 3, 2023 06:36
@weiji14
Copy link
Member Author

weiji14 commented Dec 3, 2023

Ready for review! I've added pyarrow as an 'optional' dependency in ci_tests.yaml, ci_tests_dev.yaml and ci_tests_legacy.yaml, but not ci_docs.yml or ci_doctests.yaml (because there aren't any docs/doctests using it yet). Have added a note under doc/install.rst mentioning pyarrow, but not sure if we want to list pyarrow as a proper optional dependency?

* [ ]  Check null and bool types ?

Encountered a bug with passing pd.NA values (see #2844), even without pyarrow, so not checking this for now.

@weiji14 weiji14 changed the title WIP: Check compatibility of numeric dtypes with pyarrow-backed pandas objects Check compatibility of numeric dtypes with pyarrow-backed pandas objects Dec 3, 2023
@weiji14 weiji14 requested a review from a team December 9, 2023 09:03
@weiji14 weiji14 changed the title Check compatibility of numeric dtypes with pyarrow-backed pandas objects pyarrow: Check compatibility of pyarrow-backed pandas objects with numeric dtypes Dec 9, 2023
@seisman
Copy link
Member

seisman commented Dec 9, 2023

I don't have access to my computer this weekend. will find some time this Sunday night.

pygmt/tests/test_clib_virtualfiles.py Outdated Show resolved Hide resolved
pygmt/tests/test_info.py Outdated Show resolved Hide resolved
Implementing the td.skip_if_no pytest mark inline instead of using the private function from pandas.
doc/install.rst Outdated Show resolved Hide resolved
pygmt/tests/test_clib_virtualfiles.py Outdated Show resolved Hide resolved
pygmt/tests/test_info.py Outdated Show resolved Hide resolved
@weiji14 weiji14 added the needs review This PR has higher priority and needs review. label Dec 16, 2023
doc/install.rst Outdated Show resolved Hide resolved
Co-authored-by: Dongdong Tian <[email protected]>
@weiji14 weiji14 enabled auto-merge (squash) December 16, 2023 06:08
@weiji14 weiji14 removed the needs review This PR has higher priority and needs review. label Dec 16, 2023
@weiji14 weiji14 merged commit 25914d8 into main Dec 16, 2023
17 checks passed
@weiji14 weiji14 deleted the pyarrow-compat branch December 16, 2023 06:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Boring but important stuff for the core devs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants