One test fails with pandas 3.0 dev version #3291

seisman · 2024-06-17T08:53:38Z

In the GMT Dev tests workflow, one test fails (see https://github.com/GenericMappingTools/pygmt/actions/runs/9540235948/job/26291686710).

__________________________ test_compute_bins_outfile ___________________________
[gw1] linux -- Python 3.12.3 /home/runner/micromamba/envs/pygmt/bin/python

grid = <xarray.DataArray 'z' (lat: 14, lon: 8)> Size: 448B
array([[347.5, 344.5, 386. , 640.5, 617. , 579. , 646.5, 671. ],
 ...3.5 -22.5 -21.5 -20.5 ... -12.5 -11.5 -10.5
Attributes:
    long_name:     elevation (m)
    actual_range:  [190. 981.]
expected_df =    start   stop  bin_id
0  345.5  519.5       0
1  519.5  726.5       1
region = [-52, -48, -22, -18]

    def test_compute_bins_outfile(grid, expected_df, region):
        """
        Test grdhisteq.compute_bins with ``outfile``.
        """
        with GMTTempFile(suffix=".txt") as tmpfile:
            with pytest.warns(RuntimeWarning) as record:
                result = grdhisteq.compute_bins(
                    grid=grid,
                    divisions=2,
                    region=region,
                    outfile=tmpfile.name,
                )
                assert len(record) == 1  # check that only one warning was raised
            assert result is None  # return value is None
            assert Path(tmpfile.name).stat().st_size > 0
            temp_df = pd.read_csv(
                filepath_or_buffer=tmpfile.name,
                sep="\t",
                header=None,
                names=["start", "stop", "bin_id"],
                dtype={"start": np.float32, "stop": np.float32, "bin_id": np.uint32},
                index_col="bin_id",
            )
>           pd.testing.assert_frame_equal(
                left=temp_df, right=expected_df.set_index("bin_id")
            )

../pygmt/tests/test_grdhisteq.py:130: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

left = RangeIndex(start=0, stop=2, step=1, name='bin_id')
right = Index([0, 1], dtype='uint32', name='bin_id'), obj = 'DataFrame.index'

    def _check_types(left, right, obj: str = "Index") -> None:
        if not exact:
            return
    
        assert_class_equal(left, right, exact=exact, obj=obj)
        assert_attr_equal("inferred_type", left, right, obj=obj)
    
        # Skip exact dtype checking when `check_categorical` is False
        if isinstance(left.dtype, CategoricalDtype) and isinstance(
            right.dtype, CategoricalDtype
        ):
            if check_categorical:
                assert_attr_equal("dtype", left, right, obj=obj)
                assert_index_equal(left.categories, right.categories, exact=exact)
            return
    
>       assert_attr_equal("dtype", left, right, obj=obj)
E       AssertionError: DataFrame.index are different
E       
E       Attribute "dtype" are different
E       [left]:  int64
E       [right]: uint32

The fail is likely due to changes/bugs in pandas dev version (3.x). To reproduce the issue:

>>> import numpy as np
>>> import pandas as pd
>>> df = pd.read_csv("text.dat", sep=r"\s+",
...                 header=None,
...                 names=["start", "stop", "bin_id"],
...                 dtype={"start": np.float32, "stop": np.float32, "bin_id": np.uint32},
...                 index_col="bin_id")
>>> df.index.dtype

pandas 2.x returns dtype('uint32') but pandas 3.x returns dtype('int64').

The test data is:

345.5  519.5       0
519.5  726.5       1

Need to read the pandas documentation to understand if it's a desired feature or an upstream bug.

The text was updated successfully, but these errors were encountered:

weiji14 · 2024-06-19T07:33:48Z

Had a look at this locally with pandas=3.0.0.dev0+1125.gc46fb76afa. It seems like pandas is converting the index into a RangeIndex, which has an int64 dtype by default, instead of respecting the uint32 dtype we set. This seems like a regression bug in pandas actually, there are some similar ones reported e.g. at pandas-dev/pandas#9435

For context, we chose to force the uint32 dtype for bin_id at #1433 (comment) (instead of using the int64 default in pandas 1.x/2.x). The reason was because we didn't think anyone would compute more than 2^32 bins with grdhisteq (usually 2^8=256 would be enough), and also there shouldn't be negative numbers in the bin_id column.

So, we could either go with int64 (pandas 3.0 default), or find a way to stick with uint32 (current state). What should we go for?

seisman · 2024-06-19T07:44:37Z

This seems like a regression bug in pandas actually, there are some similar ones reported e.g. at pandas-dev/pandas#9435

I think it's a pandas bug. For comparison, the following codes return the expected dtype:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: df = pd.read_csv("text.dat", sep=r"\s+",
   ...:                 header=None,
   ...:                 names=["start", "stop", "bin_id"],
   ...:                 dtype={"start": np.float32, "stop": np.float32, "bin_id": np.uint32},
   ...:                 )

In [4]: df2 = df.set_index("bin_id")

In [5]: df2.index.dtype
Out[5]: dtype('uint32')

seisman · 2024-06-24T01:13:16Z

I've reported the issue to the upstream pandas repository at pandas-dev/pandas#59077. Closing.

seisman added the upstream Bug or missing feature of upstream core GMT label Jun 17, 2024

weiji14 changed the title ~~One test fails with pandas dev version~~ One test fails with pandas 3.0 dev version Jun 19, 2024

weiji14 mentioned this issue Jun 19, 2024

RFC: grdhisteq.compute_bins: use int64 dtype for bin_id column instead of uint32 #3294

Closed

7 tasks

seisman mentioned this issue Jun 19, 2024

Workaround for the pd.read_csv's index_col bug in pandas 3.0 dev version #3295

Merged

seisman closed this as completed Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

One test fails with pandas 3.0 dev version #3291

One test fails with pandas 3.0 dev version #3291

seisman commented Jun 17, 2024 •

edited

Loading

weiji14 commented Jun 19, 2024 •

edited

Loading

seisman commented Jun 19, 2024

seisman commented Jun 24, 2024

One test fails with pandas 3.0 dev version #3291

One test fails with pandas 3.0 dev version #3291

Comments

seisman commented Jun 17, 2024 • edited Loading

weiji14 commented Jun 19, 2024 • edited Loading

seisman commented Jun 19, 2024

seisman commented Jun 24, 2024

seisman commented Jun 17, 2024 •

edited

Loading

weiji14 commented Jun 19, 2024 •

edited

Loading