-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for embedded line breaks in CSV #8350
Comments
I stumbled upon this issue today. I had to transform the data from CSV (with newlines in values/columns) to parquet in order for Trino to read it… |
Trino uses the OpenCSVSerde from Hive to read CSV tables and that serde has a number of limitations - documented https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html This would need to be fixed in the serde. |
We're using another CSV Implementation because OpenCSV is extremely slow. But the CSV implementation is not the problem here. I suppose the problem is the RecordReader which is for TextInputFormat just line-based. That means that the RecordReader is searching for a delimiter and then (after that) is parsing a record using a CSVParser. If my (quick and shallow) code analysis is correct then for CSV values with newlines in it to be parsable Trino needs a completely new RecordReader/TextInputFormat which is CSV-aware. Overall it shows that CSV is all but a simple format. |
Interesting, thanks for digging into the code. But then you loose the splittable nature of current CSV reading mecahanism and you'll be limited to single reader per CSV file instead of having multiple splits read in parallel. Tradeoffs on both sides it seems. |
Thanks for the RFC pointer. Does Apache Hive support CSV files with embedded line breaks? |
@findepi its a bad idea to re-implement wrong behaviour just to be compatible with legacy systems. That's what Microsoft did wrong for years. You cannot succeed to Hive if you're doing the same mistakes. Just my 2¢. ;) Regarding newlines in CSV values in Hive: |
that's what Hive connector is. I agree this isn't awesome path, so I do recommend you try out Iceberg and Delta connectors as well |
I encountered with this issue. Do you plan to fix it ? |
Using the Hive connector, I am trying to read a CSV which contains cells that have embedded new lines.
The RFC has it covered (page 2):
Here is an example CSV
which I try to query from a table
that returns
The text was updated successfully, but these errors were encountered: