Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse markdown found within HTML blocks #128

Closed
wants to merge 1 commit into from
Closed

Parse markdown found within HTML blocks #128

wants to merge 1 commit into from

Conversation

ron-at-swgy
Copy link

@ron-at-swgy ron-at-swgy commented Nov 22, 2023

This changeset updates the parsing behavior to pass along the content between opening and closing tags of a LOWDOWN_BLOCKHTML segment for additional markdown parsing. The strict tag matching behavior has been removed, as it lacked the context needed to operate correctly when there may be nested tags (such as nested divs).

Consider a content segment such as:

<div class="container">
<div class="container column-1">

# Header for column 1

And some text
</div>
<div class="container column-2">

# Header for column 2

With more text
</div>
</div>

With these changes, an outer LOWDOWN_BLOCKHTML node will be the parent of all the content within the "container" div. The closing tag for each block is included as the final child as a LOWDOWN_RAW_HTML node.

This changeset updates the parsing behavior to pass along the content
between opening and closing tags of a LOWDOWN_BLOCKHTML segment for
additional markdown parsing. The strict tag matching behavior has been
removed, as it lacked the context needed to operate correctly when there
may be nested tags (such as nested divs).

Consider a content segment such as:

```
<div class="container">
<div class="container column-1">

And some text
</div>
<div class="container column-2">

With more text
</div>
</div>
```
With these changes, an outer LOWDOWN_BLOCKHTML node will be the parent
of all the content within the "container" div. The closing tag for each
block is included as the final child as a LOWDOWN_RAW_HTML node.
@@ -3358,6 +3358,65 @@ parse_atxheader(struct lowdown_doc *doc, char *data, size_t size)
return skip;
}

/*
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition for html_find_block was moved up to allow visibility in html_find_end.

* Returns the length on match, 0 otherwise.
*/
static size_t
html_find_end_strict(const char *tag, size_t tag_len,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed html_find_end_strict as the way it was coded did not allow enough context to know about nested HTML blocks that may have the same tag type (such as nested divs).


result = hbuf_putc(ob, '\n');

if (!hbuf_putb(ob, content))
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Print out the inner content. Without this change, any child nodes of LOWDOWN_BLOCKHTML are passed over.

Copy link
Owner

@kristapsdz kristapsdz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this! I'll look at it over the next week or so. As this parsing is standard behaviour for pandoc, it should probably be enabled by default. Moving forward:

  • First, make sure that this flies with all regression tests (make regress).
    • This will probably require adding an option to disable the behaviour, as it conflicts with the original Markdown format, which some tests depend upon. I can help with this.
    • If regression tests for standard invocation fail by depending on interior bits not being parsed, these tests should be fixed or removed.
  • Second, add regression tests specifically for this behaviour.
  • Third, run the behaviour exhaustively through AFL to find any corner bugs. I can do this with access to bigger machines.

@ron-at-swgy
Copy link
Author

Thank you for your feedback. The Inline_HTML_Advanced regression test case breaks with my changes. I'll work towards solving that and any other regression failures.

@ron-at-swgy
Copy link
Author

I have not given up on this PR. I apologize for the lengthy delay, but still hope to get the behavior working without breaking existing functionality.

@ron-at-swgy ron-at-swgy closed this by deleting the head repository Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants