Inline sourcepos fixes. #439

kivikakk · 2024-07-12T08:27:19Z

Fixes #301.

Actually it looks like all the link stuff is just +2, probably because of the indent-but-not-really-indent.

github-actions · 2024-07-12T08:31:50Z

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`./bench.sh ./comrak-3640a8d`	328.4 ± 3.7	323.6	338.7	2.95 ± 0.05
`./bench.sh ./comrak-main`	327.2 ± 2.8	322.6	335.0	2.93 ± 0.04
`./bench.sh ./pulldown-cmark`	111.5 ± 1.2	109.7	114.6	1.00
`./bench.sh ./cmark-gfm`	121.8 ± 1.9	120.0	128.1	1.09 ± 0.02
`./bench.sh ./markdown-it`	497.0 ± 3.7	491.6	505.9	4.46 ± 0.06

Run on Fri Jul 12 08:31:50 UTC 2024

kivikakk · 2024-07-12T08:55:05Z

How awkward. In all these cases:

- A
  B

- A
B

- A
   B

the inline parser is initialised with the subject "A\nB" and a block_offset of 2, and doesn't have any distinguishing information. This is a consequence of cmark/Comrak's particular way of collecting block content before handing it off to the inline parser — it normalises things such that the inline parser doesn't need to worry about the minutae of what falls into a block and what doesn't, but we also lose information which we cannot recover.

This will require something of a rearchitecting — sourcepos info needs to be associated and carried through from much earlier in the block processing stage, with the possibility of multiple spans per content block[1] — and I have no real inclination to do such work at present.

In light of all this, for now I'm going to adjust to match upstream (#301 (comment)) and not output inline-level sourcepos in HTML in the same cases as cmark. Block-level sourcepos will be unaffected, as it's not known to have issues.

[1] In the examples above, we still need to be passing through the same inline Subject::input of "A\nB", but with sourcepos marker collections [(1:3-1:4),(2:3-2:3)], [(1:3-1:4),(2:1-2:1)] and [(1:3-1:4),(2:4-2:4)] respectively (where (1:4) aligns with the \n which is turned into a soft break). Right now we only pass a single point (1:3) and infer the rest incrementally.

This changes a lot of the bookkeeping on the block parser side, and a complete rework of sourcepos handling on the inline parser side, so I can't even say for sure that I won't hit an even bigger roadblock. Also worth noting in conjunction is this comment about autolink sourcepos unsoundness.

The information is still there in the AST, and e.g. XML output will still include it.

kivikakk added 4 commits July 12, 2024 11:13

tests: add baseline and failing.

db89e78

inlines: set not just SC but whole SP based on opener.

9a4262c

tests: this last case is abysmal.

63b967e

Actually it looks like all the link stuff is just +2, probably because of the indent-but-not-really-indent.

tests: note this *works* and must remain.

c271949

tests: another case which misses.

69ebedd

kivikakk added 2 commits July 12, 2024 11:56

tests: ignore some failures.

21d97fe

html: don't report sourcepos on inlines.

cc17182

The information is still there in the AST, and e.g. XML output will still include it.

kivikakk marked this pull request as ready for review July 12, 2024 09:14

parser: document sourcepos reliability.

983180b

kivikakk enabled auto-merge July 12, 2024 09:17

kivikakk mentioned this pull request Jul 12, 2024

Unexpected sourcepos. #301

Closed

kivikakk merged commit a38b4e3 into main Jul 12, 2024
39 checks passed

kivikakk deleted the inline-sourcepos branch July 12, 2024 09:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inline sourcepos fixes. #439

Inline sourcepos fixes. #439

kivikakk commented Jul 12, 2024

github-actions bot commented Jul 12, 2024

kivikakk commented Jul 12, 2024

Inline sourcepos fixes. #439

Inline sourcepos fixes. #439

Conversation

kivikakk commented Jul 12, 2024

github-actions bot commented Jul 12, 2024

kivikakk commented Jul 12, 2024