Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for embedded line breaks in CSV #8350

Open
bo5o opened this issue Jun 22, 2021 · 8 comments
Open

Add support for embedded line breaks in CSV #8350

bo5o opened this issue Jun 22, 2021 · 8 comments

Comments

@bo5o
Copy link

bo5o commented Jun 22, 2021

Using the Hive connector, I am trying to read a CSV which contains cells that have embedded new lines.

The RFC has it covered (page 2):

  1. Fields containing line breaks (CRLF), double quotes, and commas
    should be enclosed in double-quotes. For example:

    "aaa","b CRLF
    bb","ccc" CRLF
    zzz,yyy,xxx

Here is an example CSV

"aaa","b
bb","ccc"
zzz,yyy,xxx

which I try to query from a table

CREATE TABLE example (
    "col1" VARCHAR, 
    "col2" VARCHAR, 
    "col3" VARCHAR
) with (
    format = 'CSV',
    external_location = '/path/to/csv',
    csv_separator = ',',
    csv_quote='"'
)
select * from example

that returns

col1 | col2 | col3
aaa  |      |      
zzz  | yyy  | xxx
@realknorke
Copy link
Member

I stumbled upon this issue today. I had to transform the data from CSV (with newlines in values/columns) to parquet in order for Trino to read it…

@hashhar
Copy link
Member

hashhar commented Dec 21, 2022

Trino uses the OpenCSVSerde from Hive to read CSV tables and that serde has a number of limitations - documented https://docs.aws.amazon.com/athena/latest/ug/csv-serde.html

This would need to be fixed in the serde.

@realknorke
Copy link
Member

We're using another CSV Implementation because OpenCSV is extremely slow. But the CSV implementation is not the problem here. I suppose the problem is the RecordReader which is for TextInputFormat just line-based. That means that the RecordReader is searching for a delimiter and then (after that) is parsing a record using a CSVParser.

If my (quick and shallow) code analysis is correct then for CSV values with newlines in it to be parsable Trino needs a completely new RecordReader/TextInputFormat which is CSV-aware.

Overall it shows that CSV is all but a simple format.

@hashhar
Copy link
Member

hashhar commented Dec 21, 2022

Interesting, thanks for digging into the code. But then you loose the splittable nature of current CSV reading mecahanism and you'll be limited to single reader per CSV file instead of having multiple splits read in parallel.

Tradeoffs on both sides it seems.

@findepi
Copy link
Member

findepi commented Dec 21, 2022

The RFC has it covered (page 2):

  1. Fields containing line breaks (CRLF), double quotes, and commas
    should be enclosed in double-quotes. For example:
    "aaa","b CRLF
    bb","ccc" CRLF
    zzz,yyy,xxx

Thanks for the RFC pointer.
Note that for Hive connector behavior, the Hive itself is the reference implementation.

Does Apache Hive support CSV files with embedded line breaks?
If not, we shouldn't add such a change to Trino.

@realknorke
Copy link
Member

realknorke commented Dec 22, 2022

@findepi its a bad idea to re-implement wrong behaviour just to be compatible with legacy systems. That's what Microsoft did wrong for years. You cannot succeed to Hive if you're doing the same mistakes. Just my 2¢. ;)

Regarding newlines in CSV values in Hive:
It seems Hive cannot handle that. BUT You can define (write) your own INPUTFORMAT and add this class as a table property. By doing so it is possible to generate correct (like RFC4180) results from wellformed CSV input.
Is it possible to do something like this in Trino? Is it necessary to create a completely new FORMAT (table property)?

@findepi
Copy link
Member

findepi commented Dec 23, 2022

its a bad idea to re-implement wrong behaviour just to be compatible with legacy systems.

that's what Hive connector is.

I agree this isn't awesome path, so I do recommend you try out Iceberg and Delta connectors as well

@bass2008
Copy link

I encountered with this issue. Do you plan to fix it ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

5 participants