Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

One test fails with pandas 3.0 dev version #3291

Closed
seisman opened this issue Jun 17, 2024 · 3 comments
Closed

One test fails with pandas 3.0 dev version #3291

seisman opened this issue Jun 17, 2024 · 3 comments
Labels
upstream Bug or missing feature of upstream core GMT

Comments

@seisman
Copy link
Member

seisman commented Jun 17, 2024

In the GMT Dev tests workflow, one test fails (see https://github.com/GenericMappingTools/pygmt/actions/runs/9540235948/job/26291686710).

__________________________ test_compute_bins_outfile ___________________________
[gw1] linux -- Python 3.12.3 /home/runner/micromamba/envs/pygmt/bin/python

grid = <xarray.DataArray 'z' (lat: 14, lon: 8)> Size: 448B
array([[347.5, 344.5, 386. , 640.5, 617. , 579. , 646.5, 671. ],
 ...3.5 -22.5 -21.5 -20.5 ... -12.5 -11.5 -10.5
Attributes:
    long_name:     elevation (m)
    actual_range:  [190. 981.]
expected_df =    start   stop  bin_id
0  345.5  519.5       0
1  519.5  726.5       1
region = [-52, -48, -22, -18]

    def test_compute_bins_outfile(grid, expected_df, region):
        """
        Test grdhisteq.compute_bins with ``outfile``.
        """
        with GMTTempFile(suffix=".txt") as tmpfile:
            with pytest.warns(RuntimeWarning) as record:
                result = grdhisteq.compute_bins(
                    grid=grid,
                    divisions=2,
                    region=region,
                    outfile=tmpfile.name,
                )
                assert len(record) == 1  # check that only one warning was raised
            assert result is None  # return value is None
            assert Path(tmpfile.name).stat().st_size > 0
            temp_df = pd.read_csv(
                filepath_or_buffer=tmpfile.name,
                sep="\t",
                header=None,
                names=["start", "stop", "bin_id"],
                dtype={"start": np.float32, "stop": np.float32, "bin_id": np.uint32},
                index_col="bin_id",
            )
>           pd.testing.assert_frame_equal(
                left=temp_df, right=expected_df.set_index("bin_id")
            )

../pygmt/tests/test_grdhisteq.py:130: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

left = RangeIndex(start=0, stop=2, step=1, name='bin_id')
right = Index([0, 1], dtype='uint32', name='bin_id'), obj = 'DataFrame.index'

    def _check_types(left, right, obj: str = "Index") -> None:
        if not exact:
            return
    
        assert_class_equal(left, right, exact=exact, obj=obj)
        assert_attr_equal("inferred_type", left, right, obj=obj)
    
        # Skip exact dtype checking when `check_categorical` is False
        if isinstance(left.dtype, CategoricalDtype) and isinstance(
            right.dtype, CategoricalDtype
        ):
            if check_categorical:
                assert_attr_equal("dtype", left, right, obj=obj)
                assert_index_equal(left.categories, right.categories, exact=exact)
            return
    
>       assert_attr_equal("dtype", left, right, obj=obj)
E       AssertionError: DataFrame.index are different
E       
E       Attribute "dtype" are different
E       [left]:  int64
E       [right]: uint32

The fail is likely due to changes/bugs in pandas dev version (3.x). To reproduce the issue:

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.read_csv("text.dat", sep=r"\s+",
...                 header=None,
...                 names=["start", "stop", "bin_id"],
...                 dtype={"start": np.float32, "stop": np.float32, "bin_id": np.uint32},
...                 index_col="bin_id")
>>> df.index.dtype

pandas 2.x returns dtype('uint32') but pandas 3.x returns dtype('int64').

The test data is:

345.5  519.5       0
519.5  726.5       1

Need to read the pandas documentation to understand if it's a desired feature or an upstream bug.

@seisman seisman added the upstream Bug or missing feature of upstream core GMT label Jun 17, 2024
@weiji14 weiji14 changed the title One test fails with pandas dev version One test fails with pandas 3.0 dev version Jun 19, 2024
@weiji14
Copy link
Member

weiji14 commented Jun 19, 2024

Had a look at this locally with pandas=3.0.0.dev0+1125.gc46fb76afa. It seems like pandas is converting the index into a RangeIndex, which has an int64 dtype by default, instead of respecting the uint32 dtype we set. This seems like a regression bug in pandas actually, there are some similar ones reported e.g. at pandas-dev/pandas#9435

For context, we chose to force the uint32 dtype for bin_id at #1433 (comment) (instead of using the int64 default in pandas 1.x/2.x). The reason was because we didn't think anyone would compute more than 2^32 bins with grdhisteq (usually 2^8=256 would be enough), and also there shouldn't be negative numbers in the bin_id column.

So, we could either go with int64 (pandas 3.0 default), or find a way to stick with uint32 (current state). What should we go for?

@seisman
Copy link
Member Author

seisman commented Jun 19, 2024

This seems like a regression bug in pandas actually, there are some similar ones reported e.g. at pandas-dev/pandas#9435

I think it's a pandas bug. For comparison, the following codes return the expected dtype:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: df = pd.read_csv("text.dat", sep=r"\s+",
   ...:                 header=None,
   ...:                 names=["start", "stop", "bin_id"],
   ...:                 dtype={"start": np.float32, "stop": np.float32, "bin_id": np.uint32},
   ...:                 )

In [4]: df2 = df.set_index("bin_id")

In [5]: df2.index.dtype
Out[5]: dtype('uint32')

@seisman
Copy link
Member Author

seisman commented Jun 24, 2024

I've reported the issue to the upstream pandas repository at pandas-dev/pandas#59077. Closing.

@seisman seisman closed this as completed Jun 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
upstream Bug or missing feature of upstream core GMT
Projects
None yet
2 participants