Reading CSV file returns incorrect line break content #9797

jeffvalk · 2024-05-23T17:54:45Z

When multiple sequential new line characters appear inside a quoted CSV field, Pandoc coalesces these into a single SoftBreak in the resulting AST. According to RFC 4180, this would seem to be incorrect behavior. The RFC's grammar treats CR and LF like any other character inside a quoted field.

Shouldn't individual LineBreaks be returned for \r\n\r\n\r\n rather than a single SoftBreak by the CSV reader?

At minimum, I would think there should be no information loss during the read, which means encoding the original number of line breaks in some way. Currently, it's not possible to reconstruct the input data accurately from the AST.

Tested with Pandoc 3.1.13

The text was updated successfully, but these errors were encountered:

jeffvalk · 2024-05-23T18:21:19Z

Here is a minimal test case to reproduce:

Test input CSV:

"one_line_break:
four_line_breaks:



last_line"

Current JSON output from Pandoc (3.1.13):

{"pandoc-api-version":[1,23,1],"meta":{},"blocks":[{"t":"Table","c":[["",[],[]],[null,[]],[[{"t":"AlignDefault"},{"t":"ColWidthDefault"}]],[["",[],[]],[[["",[],[]],[[["",[],[]],{"t":"AlignDefault"},1,1,[{"t":"Plain","c":[{"t":"Str","c":"one_line_break:"},{"t":"SoftBreak"},{"t":"Str","c":"four_line_breaks:"},{"t":"SoftBreak"},{"t":"Str","c":"last_line"}]}]]]]]],[[["",[],[]],0,[],[]]],[["",[],[]],[]]]}]}

Expected JSON output (edited for accuracy to input):

{"pandoc-api-version":[1,23,1],"meta":{},"blocks":[{"t":"Table","c":[["",[],[]],[null,[]],[[{"t":"AlignDefault"},{"t":"ColWidthDefault"}]],[["",[],[]],[[["",[],[]],[[["",[],[]],{"t":"AlignDefault"},1,1,[{"t":"Plain","c":[{"t":"Str","c":"one_line_break:"},{"t":"LineBreak"},{"t":"Str","c":"four_line_breaks:"},{"t":"LineBreak"},{"t":"LineBreak"},{"t":"LineBreak"},{"t":"LineBreak"},{"t":"Str","c":"last_line"}]}]]]]]],[[["",[],[]],0,[],[]]],[["",[],[]],[]]]}]}

This is the same as the previous data, but replaces SoftBreak with LineBreak and inserts one element for each line break present in the input.

If you save the above two JSON data structures as current.json and expected.json, and run:

pandoc current.json -o current.odt
pandoc expected.json -o expected.odt

you'll see that the table cell content in current.odt does not preserve line breaks accurately, whereas expected.odt is correctly formatted like the input CSV above.

jeffvalk added the bug label May 23, 2024

jgm closed this as completed in fa01764 May 27, 2024

jeffvalk mentioned this issue Jul 19, 2024

Plain text writer does not preserve line breaks #10007

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reading CSV file returns incorrect line break content #9797

Reading CSV file returns incorrect line break content #9797

jeffvalk commented May 23, 2024

jeffvalk commented May 23, 2024 •

edited

Loading

Reading CSV file returns incorrect line break content #9797

Reading CSV file returns incorrect line break content #9797

Comments

jeffvalk commented May 23, 2024

jeffvalk commented May 23, 2024 • edited Loading

jeffvalk commented May 23, 2024 •

edited

Loading