Allow more than parquet files in redshift copy from files function #2839

m1hawkgsm · 2024-05-28T04:26:57Z

Is your feature request related to a problem? Please describe.
The awswrangler.redshift.copy_from_files function is quite powerful as it abstracts away the need to create temporary tables for upsertion as well as other details that are tedious if using the classic redshift_connector library. However, it only supports parquet files.

Would it be possible to allow other file formats such as csv? For my particular use case, I am exporting bulk amounts of data from an Aurora Postgres (to S3) and loading it to Redshift. Ideally Postgres could export to Parquet, but using the aws_s3.query_export_to_s3() function, this only allows text, binary (not parquet) or csv.

Obviously I could leverage another tool such as Glue / Spark, but that defeats the utility of this particular redshift submodule method.

Describe the solution you'd like
Can the redshift.copy_from_files method be adjusted to allow passing format argument?

Describe alternatives you've considered
Not using this method, doing it myself (defeats the purpose of using awswrangler here).

Additional context
Add any other context or screenshots about the feature request here.

P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.

The text was updated successfully, but these errors were encountered:

LeonLuttenberger · 2024-05-29T14:32:18Z

Hey,

One of the issues with formats like CSV is that, unlike Parquet or ORC, they don't store metadata on things such as column types. In order to infer types and transform them to the corresponding Redshift types, we need to load the whole table.

As such, if redshift.copy_from_files supported CSV files, it would be equivalent to just loading the CSV data using s3.read_csv and then invoking redshift.copy with the DataFrame. This also presents a simple workaround for your issue:

df = wr.s3.read_csv("s3:https://...", ...)
wr.redshift.copy(df=df, path=temp_path, table=table_name)

Let me know if this helps,
Leon

m1hawkgsm · 2024-05-30T05:24:30Z

Yeah I was thinking about that, and it makes sense. On the other hand, it is often the case that the Postgres unload (or other operations that yield CSV data, for that matter) yields files that are very large (> 20 GB), and renders loading locally quite infeasible in some cases, and inefficient in other cases (after all, that is the point of large, parallel bulk operations, right?).

Would it make sense to allow CSVs so long as you pass in the schema manually? The benefit here is enabling the reuse of how the package does merge/upserts behind the scenes (which I am ending up implementing on my own otherwise).

pvieito · 2024-06-05T12:00:52Z

@LeonLuttenberger the option to support CSV would be also very usefull for uploading dataframes with GEOMETRY columns, as ingesting it with Parquet is not supported:

Ingesting GEOMETRY columns is only supported from TEXT, CSV, or SHAPEFILE.

So AFAIK, currently there is no way to upload a df with a GEOMETRY column with aws-wrangler.

m1hawkgsm added the feature label May 28, 2024

LeonLuttenberger self-assigned this Jun 5, 2024

LeonLuttenberger mentioned this issue Jun 6, 2024

feat: Support ORC and CSV in redshift.copy_from_files function #2849

Merged

LeonLuttenberger linked a pull request Jun 6, 2024 that will close this issue

feat: Support ORC and CSV in redshift.copy_from_files function #2849

Merged

LeonLuttenberger closed this as completed in #2849 Jun 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow more than parquet files in redshift copy from files function #2839

Allow more than parquet files in redshift copy from files function #2839

m1hawkgsm commented May 28, 2024

LeonLuttenberger commented May 29, 2024

m1hawkgsm commented May 30, 2024

pvieito commented Jun 5, 2024

Allow more than parquet files in redshift copy from files function #2839

Allow more than parquet files in redshift copy from files function #2839

Comments

m1hawkgsm commented May 28, 2024

LeonLuttenberger commented May 29, 2024

m1hawkgsm commented May 30, 2024

pvieito commented Jun 5, 2024