-
Notifications
You must be signed in to change notification settings - Fork 698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ability to add metadata to parquet/orc schemas directly #2901
Labels
Comments
@walter9388, contributions are always welcome. We can discuss if |
Marking this issue as stale due to inactivity. This helps our maintainers find and focus on the active issues. If this issue receives no comments in the next 7 days it will automatically be closed. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your feature request related to a problem? Please describe.
There is a new requirement to have metadata directly in all our cloud data at my workplace (this is due to the need to move data between different hosting solutions). This means that our fallback data formats have become avro/parquet as you can attach metadata directly to the schemas. However, there is currently no direct way to do this using the
s3.to_parquet
function, so I wonder if it is possible to add this capability?Just FYI, I think the
s3.to_parquet
functionality is brilliant and saves so much effort when making glue tables with partitions etc., so I would really like to be able to carry on using it in our workflows rather than write customboto3
/pyarrow
logic.Describe the solution you'd like
Extra metadata can be added into the parquet schema using the
metadata
parameter inpa.schema
(https://arrow.apache.org/docs/python/generated/pyarrow.schema.html).Currently, the pyarrow schema is created in the
write
method in_S3WriteStrategy
via the_data_types.pyarrow_schema_from_pandas
function, which is approximately:What I propose is that we add a new
metadata
parameter in the existingpyarrow_additional_kwargs
dictionary. This avoids any changes in the API so that only a minor version bump would be needed.This would also add the capability to do it for ORC files via the same
pyarrow_additional_kwargs
argument in thes3.to_orc
function.From there the metadata can be extracted and validated in the
_S3WriteStrategy
class (or in the_S3ParquetWriteStrategy
/_S3ORCWriteStrategy
child classes separately if these formats have different metadata constraints, I haven't researched this part yet). We then can pass the metadata to an amended_data_types.pyarrow_schema_from_pandas
function:Describe alternatives you've considered
After digging into the code a bit more, I can see that you can attach your own schema directly via
pyarrow_additional_kwargs
which then overwrites the schema made byawswrangler
here.However, I would still argue that there is a need for the feature described above as I want
awswrangler
to make the schema for me, and there should be a way to simply pass a dictionary of file metadata to the schema generator function.Maybe
pyarrow_additional_kwargs
isn't the best place for it though as I can see it is expanded directly inpyarrow.parquet.ParquetWriter
, so themetadata
key would have to be popped out of the dictionary before this point.Let me know your thoughts.
Additional considerations
I know that there are several other functions in this library for handling parquet/orc metadata (i.e.
read_parquet_metadata
,read_orc_metadata
,store_parquet_metadata
), so we would need to check that these work correctly. I would have thought it would be fine though as they are designed to work with the parquet/orc specifications.I am willing to submit a PR for this feature if approved.
The text was updated successfully, but these errors were encountered: