Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading CSV file returns incorrect line break content #9797

Closed
jeffvalk opened this issue May 23, 2024 · 1 comment
Closed

Reading CSV file returns incorrect line break content #9797

jeffvalk opened this issue May 23, 2024 · 1 comment
Labels

Comments

@jeffvalk
Copy link

When multiple sequential new line characters appear inside a quoted CSV field, Pandoc coalesces these into a single SoftBreak in the resulting AST. According to RFC 4180, this would seem to be incorrect behavior. The RFC's grammar treats CR and LF like any other character inside a quoted field.

Shouldn't individual LineBreaks be returned for \r\n\r\n\r\n rather than a single SoftBreak by the CSV reader?

At minimum, I would think there should be no information loss during the read, which means encoding the original number of line breaks in some way. Currently, it's not possible to reconstruct the input data accurately from the AST.

Tested with Pandoc 3.1.13

@jeffvalk jeffvalk added the bug label May 23, 2024
@jeffvalk
Copy link
Author

jeffvalk commented May 23, 2024

Here is a minimal test case to reproduce:

Test input CSV:

"one_line_break:
four_line_breaks:



last_line"

Current JSON output from Pandoc (3.1.13):

{"pandoc-api-version":[1,23,1],"meta":{},"blocks":[{"t":"Table","c":[["",[],[]],[null,[]],[[{"t":"AlignDefault"},{"t":"ColWidthDefault"}]],[["",[],[]],[[["",[],[]],[[["",[],[]],{"t":"AlignDefault"},1,1,[{"t":"Plain","c":[{"t":"Str","c":"one_line_break:"},{"t":"SoftBreak"},{"t":"Str","c":"four_line_breaks:"},{"t":"SoftBreak"},{"t":"Str","c":"last_line"}]}]]]]]],[[["",[],[]],0,[],[]]],[["",[],[]],[]]]}]}

Expected JSON output (edited for accuracy to input):

{"pandoc-api-version":[1,23,1],"meta":{},"blocks":[{"t":"Table","c":[["",[],[]],[null,[]],[[{"t":"AlignDefault"},{"t":"ColWidthDefault"}]],[["",[],[]],[[["",[],[]],[[["",[],[]],{"t":"AlignDefault"},1,1,[{"t":"Plain","c":[{"t":"Str","c":"one_line_break:"},{"t":"LineBreak"},{"t":"Str","c":"four_line_breaks:"},{"t":"LineBreak"},{"t":"LineBreak"},{"t":"LineBreak"},{"t":"LineBreak"},{"t":"Str","c":"last_line"}]}]]]]]],[[["",[],[]],0,[],[]]],[["",[],[]],[]]]}]}

This is the same as the previous data, but replaces SoftBreak with LineBreak and inserts one element for each line break present in the input.

If you save the above two JSON data structures as current.json and expected.json, and run:

pandoc current.json -o current.odt
pandoc expected.json -o expected.odt

you'll see that the table cell content in current.odt does not preserve line breaks accurately, whereas expected.odt is correctly formatted like the input CSV above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant