Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plain text writer does not preserve line breaks #10007

Open
jeffvalk opened this issue Jul 19, 2024 · 11 comments
Open

Plain text writer does not preserve line breaks #10007

jeffvalk opened this issue Jul 19, 2024 · 11 comments
Labels

Comments

@jeffvalk
Copy link

jeffvalk commented Jul 19, 2024

The plain text writer coalesces sequential LineBreaks into a single new line.

The following example:

json='{"pandoc-api-version":[1,23,1],"meta":{},"blocks":[{"t":"Plain","c":[{"t":"Str","c":"two_line_breaks:"},{"t":"LineBreak"},{"t":"LineBreak"},{"t":"Str","c":"four_line_breaks:"},{"t":"LineBreak"},{"t":"LineBreak"},{"t":"LineBreak"},{"t":"LineBreak"},{"t":"Str","c":"last_line"}]}]}'
echo $json | pandoc -f json -t plain

produces this output:

two_line_breaks:
four_line_breaks:
last_line

where I would expect this output:

two_line_breaks:

four_line_breaks:



last_line

This is similar to the CSV reader issue (#9797) fixed in fa01764.

@jeffvalk jeffvalk added the bug label Jul 19, 2024
@jeffvalk
Copy link
Author

It may be worth noting that I'm using the plain text writer in a Lua filter as a whitespace-preserving version of the pandoc.utils.stringify function:

-- element contains raw markdown to parse
local s = pandoc.write(pandoc.Pandoc(element), "plain") -- write element AST to plain text
local b = pandoc.read(s, "markdown").blocks             -- read text as markdown

If there's a more idomatic way to do this, please let me know!

@jgm
Copy link
Owner

jgm commented Jul 20, 2024

This is because blank lines mean "new paragraph."

You can always include a RawInline or RawBlock element with whatever you want, and it will pass directly through to the output format as long as the format of the RawINline/Block matches...

@jgm
Copy link
Owner

jgm commented Jul 20, 2024

White space won't be preserved anyway; e.g. consecutive spaces would be collapsed.

@avidseeker
Copy link

I had a similar request: #9663

It would be nice if Pandoc was less aggressive in reformating markdown.

You should not expect the markdown writer to preserve aspects of your document that don't relate to its abstract structure.

@jeffvalk
Copy link
Author

This is because blank lines mean "new paragraph."

I'd fully expect this whenever a writer's format specifies it (e.g. markdown), but is this true for "plain" text? I presumed the plain text writer would output text without applying any format's semantics or production rules. I tend to think this should be the case; and if it intentionally isn't, the semantics attached to "plain" text should be documented.

White space won't be preserved anyway; e.g. consecutive spaces would be collapsed.

This depends on the reader and the specification it implements. The JSON snippet above was produced by Pandoc's CSV reader. As of fa01764, the CSV reader preserves multiple consecutive LineBreaks, which is the correct behavior according to RFC 4180 (see #9797).

@jgm
Copy link
Owner

jgm commented Jul 21, 2024

Our plain output is essentially a variant of markdown. Of course, there is no particular thing that "plain" text means.

I presumed the plain text writer would output text without applying any format's semantics or production rules.

Block-level formatting has to happen somehow. How do we represent a paragraph in plain text? A heading? Lists?

Think of plain as a stripped-down markdown that avoids using symbols to mark syntax.

Multiple consecutive LineBreaks can be rendered in some formats, e.g. HTML, and even in pandoc's markdown, which has the BACKSLASH+NEWLINE way of writing a line break. But not in formats (like plain) where (a) a line break is just a newline and a blank line is a paragraph break.

@jgm
Copy link
Owner

jgm commented Jul 21, 2024

As I hinted, though, there's an easy workaround with a Lua filter. Just have the filter replace LineBreak elements with RawInline (Format "plain") "\n".

@jeffvalk
Copy link
Author

Our plain output is essentially a variant of markdown.

Thanks, @jgm. I think this would be good to document.

Think of plain as a stripped-down markdown that avoids using symbols to mark syntax.

This description of the format in the manual would clarify nicely.

@jeffvalk
Copy link
Author

As I hinted, though, there's an easy workaround with a Lua filter. Just have the filter replace LineBreak elements with RawInline (Format "plain") "\n".

This should certainly work, and I appreciate the workaround. I do think a function for getting the text of the AST while preserving whitespace would be generally useful in Lua filters. I opened a separate issue (#10015) to propose that.

@jgm
Copy link
Owner

jgm commented Jul 28, 2024

Preserving whitespace is not possible in a filter, because the whitespace information is often lost in the parsers.

@jeffvalk
Copy link
Author

It's possible for a reader to encode whitespace in the AST; and depending on the language/format specification the reader implements, this may be required for correctness. #9797 was an instance of this in a built-in reader. And custom readers could certainly introduce others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants