Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33321: [Python] Support converting to non-nano datetime64 for pandas >= 2.0 #35656

Merged

Conversation

danepitkin
Copy link
Member

@danepitkin danepitkin commented May 17, 2023

Do not coerce temporal types to nanosecond when pandas >= 2.0 is imported, since pandas now supports s/ms/us time units.

This PR adds support for the following Arrow -> Pandas conversions, which previously all defaulted to datetime64[ns] or datetime64[ns, <TZ>]:

date32 -> datetime64[ms]
date64 -> datetime64[ms]
datetime64[s] -> datetime64[s]
datetime64[ms] -> datetime64[ms]
datetime64[us] -> datetime64[us]
datetime64[s, <TZ>] -> datetime64[s, <TZ>]
datetime64[ms, <TZ>] -> datetime64[ms, <TZ>]
datetime64[us, <TZ>] -> datetime64[us, <TZ>]

Rationale for this change

Pandas 2.0 introduces proper support for temporal types.

Are these changes tested?

Yes. Pytests added and updated.

Are there any user-facing changes?

Yes, arrow-to-pandas default conversion behavior will change when users have pandas >= 2.0, but a legacy option is exposed to provide backwards compatibility.

This PR includes breaking changes to public APIs.

@danepitkin danepitkin requested a review from AlenkaF as a code owner May 17, 2023 22:33
@github-actions
Copy link

@danepitkin
Copy link
Member Author

I'm looking for early feedback to see if this is the right approach. There are many test cases that will need updating, but I didn't want to tackle them yet in case we take a different approach.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that approach looks good, and is actually simpler than I thought it would be since we already control this with the single option switch (for the code, the tests will indeed get a bit messier).

I think one question is whether we want to make that option public through the to_pandas API, so people could still override it to get nanoseconds if they want (to get back the pre-pandas-2.0 behaviour).

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels May 23, 2023
@danepitkin danepitkin marked this pull request as draft May 23, 2023 22:47
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels May 24, 2023
@danepitkin
Copy link
Member Author

danepitkin commented May 24, 2023

Yes, that approach looks good, and is actually simpler than I thought it would be since we already control this with the single option switch (for the code, the tests will indeed get a bit messier).

I think one question is whether we want to make that option public through the to_pandas API, so people could still override it to get nanoseconds if they want (to get back the pre-pandas-2.0 behaviour).

I'll expose this! I agree its best to allow continued use of the legacy behavior for awhile.

_Type_DATE64: np.dtype('datetime64[ns]'),
_Type_TIMESTAMP: np.dtype('datetime64[ns]'),
_Type_DURATION: np.dtype('timedelta64[ns]'),
_Type_DATE32: np.dtype('datetime64[D]'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas only supports the range of second to nanoseconds, so for dates we should maybe still default to datetime64[s]? (otherwise I assume this conversion would happen anyway on the pandas side)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Numpy supports [D]ay, but pandas does not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing I found is that Parquet only support [ms], [us], and [ns]. So now several pyarrow dataset tests are failing because datasets with [D]ay units are being converted to [ms] units. I'm somewhat inclined to convert date32 to [ms] by default so we don't have to add a conversion from [ms] -> [s] when doing a parquet roundtrip. Or.. we just let it happen and modify the tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't a problem before when everything was coerced to [ns], which parquet supports.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm somewhat inclined to convert date32 to [ms] by default so we don't have to add a conversion from [ms] -> [s] when doing a parquet roundtrip

Yes, that sounds as a good idea (then it also gives the same for date32 vs date64)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

several pyarrow dataset tests are failing because datasets with [D]ay units are being converted to [ms] units

Can you point to which test is failing? Because this is about conversion from pyarrow to pandas, right? (not arrow<->parquet roundtrip, which should be able to preserve our date32 type because we store the arrow schema)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current tests manipulate the types so test cases pass, but these were the tests that originally were failling (the tests in there current state are a bit of a mess right now, I need to go back and clean them up once the implementation actually appears to work properly) :

FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions[True] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions[False] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions_and_schema[True] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions_and_schema[False] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions_and_index_name[True] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions_and_index_name[False] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_no_partitions[True] - AssertionError: Attributes of DataFrame.iloc[:, 3] (column name="date") are different
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_no_partitions[False] - AssertionError: Attributes of DataFrame.iloc[:, 3] (column name="date") are different

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    @pytest.mark.filterwarnings("ignore:'ParquetDataset.schema:FutureWarning")
    def _test_write_to_dataset_with_partitions(base_path,
                                               use_legacy_dataset=True,
                                               filesystem=None,
                                               schema=None,
                                               index_name=None):
        import pandas as pd
        import pandas.testing as tm

        import pyarrow.parquet as pq

        # ARROW-1400
        output_df = pd.DataFrame({'group1': list('aaabbbbccc'),
                                  'group2': list('eefeffgeee'),
                                  'num': list(range(10)),
                                  'nan': [np.nan] * 10,
                                  'date': np.arange('2017-01-01', '2017-01-11',
                                                    dtype='datetime64[D]')})
        output_df["date"] = output_df["date"]
        cols = output_df.columns.tolist()
        partition_by = ['group1', 'group2']
        output_table = pa.Table.from_pandas(output_df, schema=schema, safe=False,
                                            preserve_index=False)
        pq.write_to_dataset(output_table, base_path, partition_by,
                            filesystem=filesystem,
                            use_legacy_dataset=use_legacy_dataset)

        metadata_path = os.path.join(str(base_path), '_common_metadata')

        if filesystem is not None:
            with filesystem.open(metadata_path, 'wb') as f:
                pq.write_metadata(output_table.schema, f)
        else:
            pq.write_metadata(output_table.schema, metadata_path)

        # ARROW-2891: Ensure the output_schema is preserved when writing a
        # partitioned dataset
        dataset = pq.ParquetDataset(base_path,
                                    filesystem=filesystem,
                                    validate_schema=True,
                                    use_legacy_dataset=use_legacy_dataset)
        # ARROW-2209: Ensure the dataset schema also includes the partition columns
        if use_legacy_dataset:
            with pytest.warns(FutureWarning, match="'ParquetDataset.schema'"):
                dataset_cols = set(dataset.schema.to_arrow_schema().names)
        else:
            # NB schema property is an arrow and not parquet schema
            dataset_cols = set(dataset.schema.names)

        assert dataset_cols == set(output_table.schema.names)

        input_table = dataset.read(use_pandas_metadata=True)

        input_df = input_table.to_pandas()

        # Read data back in and compare with original DataFrame
        # Partitioned columns added to the end of the DataFrame when read
        input_df_cols = input_df.columns.tolist()
        assert partition_by == input_df_cols[-1 * len(partition_by):]

        input_df = input_df[cols]
        # Partitioned columns become 'categorical' dtypes
        for col in partition_by:
            output_df[col] = output_df[col].astype('category')
        # if schema is None and Version(pd.__version__) >= Version("2.0.0"):
        #     output_df['date'] = output_df['date'].astype('datetime64[ms]')
>       tm.assert_frame_equal(output_df, input_df)
E       AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
E
E       Attribute "dtype" are different
E       [left]:  datetime64[s]
E       [right]: datetime64[ms]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those test failures are related to the fact that, with our defaults, parquet doesn't support nanoseconds, and we actually don't try to preserve the unit when roundtripping from arrow<->parquet:

In [1]: table = pa.table({"col": pa.array([1, 2, 3], pa.timestamp("s")).cast(pa.timestamp("ns"))})

In [2]: import pyarrow.parquet as pq

In [3]: pq.write_table(table, "test_nanoseconds.parquet")

In [4]: pq.read_table("test_nanoseconds.parquet")
Out[4]: 
pyarrow.Table
col: timestamp[us]
----
col: [[1970-01-01 00:00:01.000000,1970-01-01 00:00:02.000000,1970-01-01 00:00:03.000000]]

So starting with an arrow table with nanoseconds, the result has microseconds (even though we actually could preserve the original unit, because we store the original arrow schema in the parquet metadata. Although that would not be a zero copy restoration, in contrast to for example restoring the timezone, or restoring duration from int64, which is done in ApplyOriginalStorageMetadata)

So this means that whenever we start with nanoseconds, we get back microseconds after roundtrip to parquet. And then if the roundtrip actually started from pandas using nanoseconds, we now also get microseconds in the pandas result (while before we still got nanoseconds since we forced using that in the arrow->pandas conversion step) ..

python/pyarrow/types.pxi Outdated Show resolved Hide resolved
python/pyarrow/array.pxi Show resolved Hide resolved
python/pyarrow/array.pxi Outdated Show resolved Hide resolved
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels May 25, 2023
@danepitkin danepitkin force-pushed the danepitkin/arrow-33321-pd-ns branch from bf472be to f0f8cba Compare June 5, 2023 19:48
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 5, 2023
@danepitkin danepitkin force-pushed the danepitkin/arrow-33321-pd-ns branch from 37e120c to 0fb1c51 Compare June 6, 2023 22:51
@danepitkin
Copy link
Member Author

danepitkin commented Jun 6, 2023

Current update: the tests failing locally for me are 1) parquet dataset roundtrips where date32 days are converted to milliseconds instead of seconds because seconds are not supported in parquet and 2) all TZ-aware timestamps are defaulted to nanoseconds (aka I need to add support for other time units in c++).

For (1), I mentioned in another comment that we can convert date32 to millisecond instead of second. For (2), I just need to add support, but it's going to grow this PR even larger unfortunately..

Edit: For (1), I think its actually fine to keep date32 as [s]econd. It's a known limitation that parquet does not support this unit type.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jun 8, 2023
@jorisvandenbossche
Copy link
Member

For (2), I just need to add support, but it's going to grow this PR even larger unfortunately..

If PR size is a concern, this is also something that could be done as a precursor. It's actually already an issue that shows in conversion to numpy as well:

# no timezone -> this preserves the unit
>>> pa.array([1, 2, 3], pa.timestamp('us')).to_numpy()
array(['1970-01-01T00:00:00.000001', '1970-01-01T00:00:00.000002',
       '1970-01-01T00:00:00.000003'], dtype='datetime64[us]')

# with timezone -> always converts to nanoseconds
>>> pa.array([1, 2, 3], pa.timestamp('us', tz="Europe/Brussels")).to_numpy()
...
ArrowInvalid: Needed to copy 1 chunks with 0 nulls, but zero_copy_only was True

>>> pa.array([1, 2, 3], pa.timestamp('us', tz="Europe/Brussels")).to_numpy(zero_copy_only=False)
array(['1970-01-01T00:00:00.000001000', '1970-01-01T00:00:00.000002000',
       '1970-01-01T00:00:00.000003000'], dtype='datetime64[ns]')

While this could also be perfectly zero-copy to microseconds in the case with a timezone (we just return the underlying UTC values anyway)

@jorisvandenbossche jorisvandenbossche added the Breaking Change Includes a breaking change to the API label Jun 8, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jun 8, 2023
@jorisvandenbossche
Copy link
Member

For the tz aware update, that also influences the (currently untested) to_numpy behaviour, which you can test with the following change:

--- a/python/pyarrow/tests/test_array.py
+++ b/python/pyarrow/tests/test_array.py
@@ -211,9 +211,10 @@ def test_to_numpy_writable():
         arr.to_numpy(zero_copy_only=True, writable=True)
 
 
+@pytest.mark.parametrize('tz', [None, "UTC"])
 @pytest.mark.parametrize('unit', ['s', 'ms', 'us', 'ns'])
-def test_to_numpy_datetime64(unit):
-    arr = pa.array([1, 2, 3], pa.timestamp(unit))
+def test_to_numpy_datetime64(unit, tz):
+    arr = pa.array([1, 2, 3], pa.timestamp(unit, tz=tz))
     expected = np.array([1, 2, 3], dtype="datetime64[{}]".format(unit))
     np_arr = arr.to_numpy()
     np.testing.assert_array_equal(np_arr, expected)

@danepitkin
Copy link
Member Author

For the tz aware update, that also influences the (currently untested) to_numpy behaviour, which you can test with the following change:

--- a/python/pyarrow/tests/test_array.py
+++ b/python/pyarrow/tests/test_array.py
@@ -211,9 +211,10 @@ def test_to_numpy_writable():
         arr.to_numpy(zero_copy_only=True, writable=True)
 
 
+@pytest.mark.parametrize('tz', [None, "UTC"])
 @pytest.mark.parametrize('unit', ['s', 'ms', 'us', 'ns'])
-def test_to_numpy_datetime64(unit):
-    arr = pa.array([1, 2, 3], pa.timestamp(unit))
+def test_to_numpy_datetime64(unit, tz):
+    arr = pa.array([1, 2, 3], pa.timestamp(unit, tz=tz))
     expected = np.array([1, 2, 3], dtype="datetime64[{}]".format(unit))
     np_arr = arr.to_numpy()
     np.testing.assert_array_equal(np_arr, expected)

Thank you! Updated and it passes out of the gate 🎉

python/pyarrow/src/arrow/python/arrow_to_pandas.cc Outdated Show resolved Hide resolved
python/pyarrow/src/arrow/python/arrow_to_pandas.cc Outdated Show resolved Hide resolved
int64_t* out_values = this->GetBlockColumnStart(rel_placement);
if (type == Type::DATE32) {
// Convert from days since epoch to datetime64[ms]
ConvertDatetimeLikeNanos<int32_t, 86400000L>(*data, out_values);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering that if we do such naive multiplication here, we can get overflow errors for out-of-bounds timestamps (but this is already the case for the current code converting to nanoseconds, as well, to be clear)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, I think this specific multiplication is fine. INT32_MAX * 86400000 = 1.8554259e+17, while INT64_MAX = 9.223372e+18.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, when milliseconds is the target this is probably fine. For the nanoseconds case, though, this gives wrong results. Opened #36084 about that.

@@ -176,8 +176,8 @@ def alltypes_sample(size=10000, seed=0, categorical=False):
# TODO(wesm): Test other timestamp resolutions now that arrow supports
# them
'datetime': np.arange("2016-01-01T00:00:00.001", size,
dtype='datetime64[ms]').astype('datetime64[ns]'),
'timedelta': np.arange(0, size, dtype="timedelta64[ns]"),
dtype='datetime64[ms]'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can maybe keep both original ns and new ms resolution? (to test both)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quickly enabling it does open up a small can of worms:

FAILED pyarrow/tests/parquet/test_data_types.py::test_parquet_2_0_roundtrip[None-True] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_data_types.py::test_parquet_2_0_roundtrip[None-False] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_data_types.py::test_parquet_2_0_roundtrip[1000-True] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_data_types.py::test_parquet_2_0_roundtrip[1000-False] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions_and_schema[True] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_with_partitions_and_schema[False] - AssertionError: Attributes of DataFrame.iloc[:, 4] (column name="date") are different
FAILED pyarrow/tests/parquet/test_metadata.py::test_parquet_metadata_api - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_metadata.py::test_compare_schemas - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_pandas.py::test_pandas_parquet_custom_metadata - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_pandas.py::test_pandas_parquet_column_multiindex[True] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_pandas.py::test_pandas_parquet_column_multiindex[False] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_pandas.py::test_pandas_parquet_2_0_roundtrip_read_pandas_no_index_written[True] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_pandas.py::test_pandas_parquet_2_0_roundtrip_read_pandas_no_index_written[False] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_parquet_file.py::test_iter_batches_columns_reader[300] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_parquet_file.py::test_iter_batches_columns_reader[1000] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_parquet_file.py::test_iter_batches_columns_reader[1300] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001
FAILED pyarrow/tests/parquet/test_parquet_file.py::test_iter_batches_reader[1000] - pyarrow.lib.ArrowInvalid: Casting from timestamp[ns] to timestamp[ms] would lose data: 1451606400001000001

Maybe better for a follow up PR?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I am adding and fixing the tests. Just needed to remove coercion to ms in the test cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion!

python/pyarrow/tests/test_pandas.py Outdated Show resolved Hide resolved
@@ -1202,7 +1211,7 @@ def test_table_convert_date_as_object(self):
df_datetime = table.to_pandas(date_as_object=False)
df_object = table.to_pandas()

tm.assert_frame_equal(df.astype('datetime64[ns]'), df_datetime,
tm.assert_frame_equal(df.astype('datetime64[ms]'), df_datetime,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have coverage for testing that it stays nanoseconds if you specify coerce_temporal_nanoseconds=True?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, will add!

@danepitkin danepitkin force-pushed the danepitkin/arrow-33321-pd-ns branch from a65301e to 6ffb5e5 Compare July 6, 2023 14:50
@danepitkin
Copy link
Member Author

@github-actions crossbow submit -g integration

@github-actions
Copy link

github-actions bot commented Jul 6, 2023

Revision: 6ffb5e5

Submitted crossbow builds: ursacomputing/crossbow @ actions-40508e3899

Task Status
test-conda-python-3.10-hdfs-2.9.2 Github Actions
test-conda-python-3.10-hdfs-3.2.1 Github Actions
test-conda-python-3.10-pandas-latest Github Actions
test-conda-python-3.10-pandas-nightly Github Actions
test-conda-python-3.10-spark-master Github Actions
test-conda-python-3.11-dask-latest Github Actions
test-conda-python-3.11-dask-upstream_devel Github Actions
test-conda-python-3.11-pandas-upstream_devel Github Actions
test-conda-python-3.8-pandas-1.0 Github Actions
test-conda-python-3.8-spark-v3.1.2 Github Actions
test-conda-python-3.9-pandas-latest Github Actions
test-conda-python-3.9-spark-v3.2.0 Github Actions

python/pyarrow/array.pxi Outdated Show resolved Hide resolved
Comment on lines 1319 to 1321
# Arrow to Pandas v2 will convert date32 to [ms]. Pandas v1 will always
# silently coerce to [ns] due to non-[ns] support.
expected_date_type = 'datetime64[ms]'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not fully correct, I think (when converting the pandas dataframe to pyarrow, we actually don't have date32, but timestamp type). But then I also don't understand how this test is passing ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what actually happens with pandas 2.x: when we create a DataFrame with datetime64[D], that gets converted to datetime64[s] (closest supported resolution to "D"). Then roundtripping to parquet turns that into "ms" (because "s" is not supported by Parquet)

With older pandas this gets converted to datetime64[ns], will come back from Parquet as "us", and converted back to "ns" when converting to pandas. But this astype("datetime64[ms]") essentially doesn't do anything, i.e. pandas does preserve the "ns" because it doesn't support "ms", and hence the test also passes for older pandas.

Maybe it's simpler to just test with a DataFrame of nanoseconds, which now works the same with old and new pandas, and then we don't have to add any comment or astype.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it's simpler to just test with a DataFrame of nanoseconds, which now works the same with old and new pandas, and then we don't have to add any comment or astype.

Hmm, trying that out locally fails (but only with the non-legacy code path), and digging in, it seems that we are still writing Parquet v 1 files with the dataset API ...
Will open a separate issue and PR to quickly fix that separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> #36538

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Jul 7, 2023
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Jul 7, 2023
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a small clean-up on top of my PR to fix the Parquet version for the dataset writer. Further checked all changes to the parquet tests as well, and all looks good!

@jorisvandenbossche
Copy link
Member

@github-actions crossbow submit -g integration

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Jul 7, 2023
@jorisvandenbossche
Copy link
Member

And with this PR and the Parquet v2.6 update combined, the failures in the dask builds are now much smaller (just one failure that was testing that a timestamp would overflow by being casted to nanoseconds)

@github-actions
Copy link

github-actions bot commented Jul 7, 2023

Revision: a6487c2

Submitted crossbow builds: ursacomputing/crossbow @ actions-5c8f3f6cad

Task Status
test-conda-python-3.10-hdfs-2.9.2 Github Actions
test-conda-python-3.10-hdfs-3.2.1 Github Actions
test-conda-python-3.10-pandas-latest Github Actions
test-conda-python-3.10-pandas-nightly Github Actions
test-conda-python-3.10-spark-master Github Actions
test-conda-python-3.11-dask-latest Github Actions
test-conda-python-3.11-dask-upstream_devel Github Actions
test-conda-python-3.11-pandas-upstream_devel Github Actions
test-conda-python-3.8-pandas-1.0 Github Actions
test-conda-python-3.8-spark-v3.1.2 Github Actions
test-conda-python-3.9-pandas-latest Github Actions
test-conda-python-3.9-spark-v3.2.0 Github Actions

@jorisvandenbossche jorisvandenbossche marked this pull request as ready for review July 7, 2023 14:44
@jorisvandenbossche jorisvandenbossche merged commit 4f56aba into apache:main Jul 7, 2023
13 checks passed
@jorisvandenbossche jorisvandenbossche removed the awaiting merge Awaiting merge label Jul 7, 2023
@jorisvandenbossche
Copy link
Member

Thanks @danepitkin!

@danepitkin
Copy link
Member Author

Thanks @jorisvandenbossche for the collaboration and support!

Comment on lines +4262 to +4263
[pd.period_range("2012-01-01", periods=3, freq="D").array,
pd.interval_range(1, 4).array])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jorisvandenbossche @danepitkin We can't use pd here because pandas may not be available.
This causes an error on "no pandas" environment: https://github.com/apache/arrow/actions/runs/5496447565/jobs/10016477233
This PR's CI succeeded because our "Without Pandas" job installed pandas implicitly. It has been fixed by #36542.

Could you open an issue for this and fix this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed -> #36586

@github-actions github-actions bot added the awaiting changes Awaiting changes label Jul 9, 2023
@conbench-apache-arrow
Copy link

Conbench analyzed the 6 benchmark runs on commit 4f56aba3.

There were 7 benchmark results indicating a performance regression:

The full Conbench report has more details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting changes Awaiting changes Breaking Change Includes a breaking change to the API Component: Python
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] Support converting to non-nano datetime64 for pandas >= 2.0
3 participants