Ability to add metadata to parquet/orc schemas directly #2901

walter9388 · 2024-07-18T12:17:23Z

Is your feature request related to a problem? Please describe.
There is a new requirement to have metadata directly in all our cloud data at my workplace (this is due to the need to move data between different hosting solutions). This means that our fallback data formats have become avro/parquet as you can attach metadata directly to the schemas. However, there is currently no direct way to do this using the s3.to_parquet function, so I wonder if it is possible to add this capability?

Just FYI, I think the s3.to_parquet functionality is brilliant and saves so much effort when making glue tables with partitions etc., so I would really like to be able to carry on using it in our workflows rather than write custom boto3/pyarrow logic.

Describe the solution you'd like
Extra metadata can be added into the parquet schema using the metadata parameter in pa.schema (https://arrow.apache.org/docs/python/generated/pyarrow.schema.html).

Currently, the pyarrow schema is created in the write method in _S3WriteStrategy via the _data_types.pyarrow_schema_from_pandas function, which is approximately:

def pyarrow_schema_from_pandas(
    df: pd.DataFrame,
    index: bool,
    ignore_cols: list[str] | None = None,
    dtype: dict[str, str] | None = None
) -> pa.Schema:
    ...
    return pa.schema(fields=columns_types)

What I propose is that we add a new metadata parameter in the existing pyarrow_additional_kwargs dictionary. This avoids any changes in the API so that only a minor version bump would be needed.
This would also add the capability to do it for ORC files via the same pyarrow_additional_kwargs argument in the s3.to_orc function.

From there the metadata can be extracted and validated in the _S3WriteStrategy class (or in the _S3ParquetWriteStrategy/_S3ORCWriteStrategy child classes separately if these formats have different metadata constraints, I haven't researched this part yet). We then can pass the metadata to an amended _data_types.pyarrow_schema_from_pandas function:

def pyarrow_schema_from_pandas(
    df: pd.DataFrame,
    index: bool,
+   metadata: dict[str, str],
    ignore_cols: list[str] | None = None,
    dtype: dict[str, str] | None = None
) -> pa.Schema:
    ...
-   return pa.schema(fields=columns_types)
+   return pa.schema(fields=columns_types, metadata=metadata)

Describe alternatives you've considered
After digging into the code a bit more, I can see that you can attach your own schema directly via pyarrow_additional_kwargs which then overwrites the schema made by awswrangler here.

However, I would still argue that there is a need for the feature described above as I want awswrangler to make the schema for me, and there should be a way to simply pass a dictionary of file metadata to the schema generator function.
Maybe pyarrow_additional_kwargs isn't the best place for it though as I can see it is expanded directly in pyarrow.parquet.ParquetWriter, so the metadata key would have to be popped out of the dictionary before this point.
Let me know your thoughts.

Additional considerations
I know that there are several other functions in this library for handling parquet/orc metadata (i.e. read_parquet_metadata, read_orc_metadata, store_parquet_metadata), so we would need to check that these work correctly. I would have thought it would be fine though as they are designed to work with the parquet/orc specifications.

I am willing to submit a PR for this feature if approved.

The text was updated successfully, but these errors were encountered:

jaidisido · 2024-07-22T08:08:46Z

@walter9388, contributions are always welcome. We can discuss if pyarrow_additional_kwargs is indeed the best input argument to hold metadata on your PR

github-actions · 2024-09-20T09:04:56Z

Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed.

walter9388 added the feature label Jul 18, 2024

github-actions bot added the closing-soon label Sep 20, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ability to add metadata to parquet/orc schemas directly #2901

Ability to add metadata to parquet/orc schemas directly #2901

walter9388 commented Jul 18, 2024

jaidisido commented Jul 22, 2024

github-actions bot commented Sep 20, 2024

Ability to add metadata to parquet/orc schemas directly #2901

Ability to add metadata to parquet/orc schemas directly #2901

Comments

walter9388 commented Jul 18, 2024

jaidisido commented Jul 22, 2024

github-actions bot commented Sep 20, 2024