Parse markdown found within HTML blocks #128

ron-at-swgy · 2023-11-22T17:40:15Z

This changeset updates the parsing behavior to pass along the content between opening and closing tags of a LOWDOWN_BLOCKHTML segment for additional markdown parsing. The strict tag matching behavior has been removed, as it lacked the context needed to operate correctly when there may be nested tags (such as nested divs).

Consider a content segment such as:

<div class="container">
<div class="container column-1">

# Header for column 1

And some text
</div>
<div class="container column-2">

# Header for column 2

With more text
</div>
</div>

With these changes, an outer LOWDOWN_BLOCKHTML node will be the parent of all the content within the "container" div. The closing tag for each block is included as the final child as a LOWDOWN_RAW_HTML node.

This changeset updates the parsing behavior to pass along the content between opening and closing tags of a LOWDOWN_BLOCKHTML segment for additional markdown parsing. The strict tag matching behavior has been removed, as it lacked the context needed to operate correctly when there may be nested tags (such as nested divs). Consider a content segment such as: ``` <div class="container"> <div class="container column-1"> And some text </div> <div class="container column-2"> With more text </div> </div> ``` With these changes, an outer LOWDOWN_BLOCKHTML node will be the parent of all the content within the "container" div. The closing tag for each block is included as the final child as a LOWDOWN_RAW_HTML node.

ron-at-swgy · 2023-11-22T17:41:08Z

document.c

@@ -3358,6 +3358,65 @@ parse_atxheader(struct lowdown_doc *doc, char *data, size_t size)
 return skip;
 }

+/*


The definition for html_find_block was moved up to allow visibility in html_find_end.

ron-at-swgy · 2023-11-22T17:41:55Z

document.c

- * Returns the length on match, 0 otherwise.
- */
-static size_t
-html_find_end_strict(const char *tag, size_t tag_len,


Removed html_find_end_strict as the way it was coded did not allow enough context to know about nested HTML blocks that may have the same tag type (such as nested divs).

ron-at-swgy · 2023-11-22T17:43:02Z

html.c

+
+ result = hbuf_putc(ob, '\n');
+
+ if (!hbuf_putb(ob, content))


Print out the inner content. Without this change, any child nodes of LOWDOWN_BLOCKHTML are passed over.

kristapsdz

Thank you for this! I'll look at it over the next week or so. As this parsing is standard behaviour for pandoc, it should probably be enabled by default. Moving forward:

First, make sure that this flies with all regression tests (make regress).
- This will probably require adding an option to disable the behaviour, as it conflicts with the original Markdown format, which some tests depend upon. I can help with this.
- If regression tests for standard invocation fail by depending on interior bits not being parsed, these tests should be fixed or removed.
Second, add regression tests specifically for this behaviour.
Third, run the behaviour exhaustively through AFL to find any corner bugs. I can do this with access to bigger machines.

ron-at-swgy · 2023-12-05T13:34:19Z

Thank you for your feedback. The Inline_HTML_Advanced regression test case breaks with my changes. I'll work towards solving that and any other regression failures.

ron-at-swgy · 2024-03-26T17:17:45Z

I have not given up on this PR. I apologize for the lengthy delay, but still hope to get the behavior working without breaking existing functionality.

ron-at-swgy commented Nov 22, 2023

View reviewed changes

ron-at-swgy mentioned this pull request Nov 22, 2023

Using <div> with inline HTML causes incorrect parsing #127

Closed

kristapsdz reviewed Dec 3, 2023

View reviewed changes

ron-at-swgy closed this by deleting the head repository Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse markdown found within HTML blocks #128

Parse markdown found within HTML blocks #128

ron-at-swgy commented Nov 22, 2023 •

edited

Loading

ron-at-swgy Nov 22, 2023

ron-at-swgy Nov 22, 2023

ron-at-swgy Nov 22, 2023

kristapsdz left a comment

ron-at-swgy commented Dec 5, 2023

ron-at-swgy commented Mar 26, 2024

Parse markdown found within HTML blocks #128

Parse markdown found within HTML blocks #128

Conversation

ron-at-swgy commented Nov 22, 2023 • edited Loading

ron-at-swgy Nov 22, 2023

Choose a reason for hiding this comment

ron-at-swgy Nov 22, 2023

Choose a reason for hiding this comment

ron-at-swgy Nov 22, 2023

Choose a reason for hiding this comment

kristapsdz left a comment

Choose a reason for hiding this comment

ron-at-swgy commented Dec 5, 2023

ron-at-swgy commented Mar 26, 2024

ron-at-swgy commented Nov 22, 2023 •

edited

Loading