Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow more than parquet files in redshift copy from files function #2839

Closed
m1hawkgsm opened this issue May 28, 2024 · 3 comments · Fixed by #2849
Closed

Allow more than parquet files in redshift copy from files function #2839

m1hawkgsm opened this issue May 28, 2024 · 3 comments · Fixed by #2849
Assignees
Labels

Comments

@m1hawkgsm
Copy link

Is your feature request related to a problem? Please describe.
The awswrangler.redshift.copy_from_files function is quite powerful as it abstracts away the need to create temporary tables for upsertion as well as other details that are tedious if using the classic redshift_connector library. However, it only supports parquet files.

Would it be possible to allow other file formats such as csv? For my particular use case, I am exporting bulk amounts of data from an Aurora Postgres (to S3) and loading it to Redshift. Ideally Postgres could export to Parquet, but using the aws_s3.query_export_to_s3() function, this only allows text, binary (not parquet) or csv.

Obviously I could leverage another tool such as Glue / Spark, but that defeats the utility of this particular redshift submodule method.

Describe the solution you'd like
Can the redshift.copy_from_files method be adjusted to allow passing format argument?

Describe alternatives you've considered
Not using this method, doing it myself (defeats the purpose of using awswrangler here).

Additional context
Add any other context or screenshots about the feature request here.

P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.

@LeonLuttenberger
Copy link
Contributor

Hey,

One of the issues with formats like CSV is that, unlike Parquet or ORC, they don't store metadata on things such as column types. In order to infer types and transform them to the corresponding Redshift types, we need to load the whole table.

As such, if redshift.copy_from_files supported CSV files, it would be equivalent to just loading the CSV data using s3.read_csv and then invoking redshift.copy with the DataFrame. This also presents a simple workaround for your issue:

df = wr.s3.read_csv("s3:https://...", ...)
wr.redshift.copy(df=df, path=temp_path, table=table_name)

Let me know if this helps,
Leon

@m1hawkgsm
Copy link
Author

Yeah I was thinking about that, and it makes sense. On the other hand, it is often the case that the Postgres unload (or other operations that yield CSV data, for that matter) yields files that are very large (> 20 GB), and renders loading locally quite infeasible in some cases, and inefficient in other cases (after all, that is the point of large, parallel bulk operations, right?).

Would it make sense to allow CSVs so long as you pass in the schema manually? The benefit here is enabling the reuse of how the package does merge/upserts behind the scenes (which I am ending up implementing on my own otherwise).

@pvieito
Copy link

pvieito commented Jun 5, 2024

@LeonLuttenberger the option to support CSV would be also very usefull for uploading dataframes with GEOMETRY columns, as ingesting it with Parquet is not supported:

Ingesting GEOMETRY columns is only supported from TEXT, CSV, or SHAPEFILE.

So AFAIK, currently there is no way to upload a df with a GEOMETRY column with aws-wrangler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants