Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inline sourcepos fixes. #439

Merged
merged 8 commits into from
Jul 12, 2024
Merged

Inline sourcepos fixes. #439

merged 8 commits into from
Jul 12, 2024

Conversation

kivikakk
Copy link
Owner

Fixes #301.

Copy link
Contributor

Command Mean [ms] Min [ms] Max [ms] Relative
./bench.sh ./comrak-3640a8d 328.4 ± 3.7 323.6 338.7 2.95 ± 0.05
./bench.sh ./comrak-main 327.2 ± 2.8 322.6 335.0 2.93 ± 0.04
./bench.sh ./pulldown-cmark 111.5 ± 1.2 109.7 114.6 1.00
./bench.sh ./cmark-gfm 121.8 ± 1.9 120.0 128.1 1.09 ± 0.02
./bench.sh ./markdown-it 497.0 ± 3.7 491.6 505.9 4.46 ± 0.06

Run on Fri Jul 12 08:31:50 UTC 2024

@kivikakk
Copy link
Owner Author

How awkward. In all these cases:

- A
  B
- A
B
- A
   B

the inline parser is initialised with the subject "A\nB" and a block_offset of 2, and doesn't have any distinguishing information. This is a consequence of cmark/Comrak's particular way of collecting block content before handing it off to the inline parser — it normalises things such that the inline parser doesn't need to worry about the minutae of what falls into a block and what doesn't, but we also lose information which we cannot recover.

This will require something of a rearchitecting — sourcepos info needs to be associated and carried through from much earlier in the block processing stage, with the possibility of multiple spans per content block[1] — and I have no real inclination to do such work at present.

In light of all this, for now I'm going to adjust to match upstream (#301 (comment)) and not output inline-level sourcepos in HTML in the same cases as cmark. Block-level sourcepos will be unaffected, as it's not known to have issues.


[1] In the examples above, we still need to be passing through the same inline Subject::input of "A\nB", but with sourcepos marker collections [(1:3-1:4),(2:3-2:3)], [(1:3-1:4),(2:1-2:1)] and [(1:3-1:4),(2:4-2:4)] respectively (where (1:4) aligns with the \n which is turned into a soft break). Right now we only pass a single point (1:3) and infer the rest incrementally.

This changes a lot of the bookkeeping on the block parser side, and a complete rework of sourcepos handling on the inline parser side, so I can't even say for sure that I won't hit an even bigger roadblock. Also worth noting in conjunction is this comment about autolink sourcepos unsoundness.

The information is still there in the AST, and e.g. XML output will still include it.
@kivikakk kivikakk marked this pull request as ready for review July 12, 2024 09:14
@kivikakk kivikakk enabled auto-merge July 12, 2024 09:17
@kivikakk kivikakk mentioned this pull request Jul 12, 2024
@kivikakk kivikakk merged commit a38b4e3 into main Jul 12, 2024
39 checks passed
@kivikakk kivikakk deleted the inline-sourcepos branch July 12, 2024 09:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unexpected sourcepos.
1 participant