[spec/interpreter/test] Align definition of newline with Unicode reco… #1684

rossberg · 2023-09-25T13:32:49Z

So far, the Wasm text format only recognises LF as a newline. This PR tweaks that definition to align with Section 5.8 of the Unicode standard, which recommends treating CR, LF, CRLF, and NEL as equivalent newlines in text files, in order to be agnostic to newline translations performed by various tools and conversions between OSes and character sets.

This only affects the interpretation of line comments, where previously, code like

;; comment<CR>
(instr)

would (surprisingly) treat the instruction as part of the comment, but no longer does now. While theoretically a breaking change, it shouldn't affect any well-formatted code. It hence seems okay to consider this a bug fix.

…mmendation

sunfishcode · 2023-10-11T09:15:48Z

I propose not recognizing NEL (U+85) as a newline character. It is not recognized as a newline in popular text formats, tools, or editors, so it would be inconvenient to work with if files using NEL newlines were in circulation. An incomplete list of popular text formats that don't recognize NEL:

rossberg · 2023-10-11T14:08:41Z

Yeah, I'm not feeling strongly. I'd assume there is value in following the official Unicode recommendation, even though NEL hardly matters in practice. I believe the motivation for that recommendation is maximising compatibility with legacy character sets, but I'd be fine limiting that concern to ASCII.

That said, there are various other codepoints in Unicode that represent forms of line breaks, and there likely is no consistency in how editors and tools treat those. So it's doubtful if that form of alignment is even possible, independent of NEL.

sunfishcode · 2023-10-11T16:22:03Z

Yes, the situations with plain-CR, LS, and PS are complex, and there's no way to avoid some problems no matter what we do here. But, we can still avoid other problems, which I propose we do here.

tlively · 2023-10-16T19:38:43Z

Do NEL characters ever appear in practice? If not, following unicode would not cause any additional problems in practice and we would get the benefits of just doing what unicode recommends (at least for this particular sliver of the design space).

sunfishcode · 2023-10-24T00:55:00Z

Do NEL characters ever appear in practice? If not, following unicode would not cause any additional problems in practice and we would get the benefits of just doing what unicode recommends (at least for this particular sliver of the design space).

I'm not aware of any practical benefits of following this particular Unicode recommendation here. NEL's only known use in practice is in text converted from EBCDIC. My understanding is that Unicode added the recommendation about NEL for the benefit of interoperability with EBCDIC documents. I don't believe this is relevant to Wasm.

Additional examples of things that don't interpret U+85 (NEL) as a newline include JS, CSS, HTML, Python, and many more.

tlively · 2023-10-25T02:21:38Z

I do think that being able to offload all decision making in this area to unicode is an advantage, but I'll admit it's a small one. I guess being consistent with so many existing text formats is probably a larger advantage, so not treating NEL as a newline sounds fine to me.

rossberg · 2023-10-25T07:11:18Z

Okay, I don't care enough, so I just removed NEL. PTAL.

tlively · 2023-10-25T19:44:07Z

document/core/text/lexical.rst

-     \unicode{09} ~|~ \unicode{0A} ~|~ \unicode{0D} \\
+     \Tnewline ~|~ \unicode{09} \\
+   \production{newline} & \Tnewline &::=&
+     \unicode{0A} ~|~ \unicode{0D} ~|~ \unicode{0D}~\unicode{0A} \\


If we got rid of the last case here, then it would parse as two newlines in a row, which I don't think would break anything. Is it worth making that simplification, or is it better to keep that case for consistency with the standard definition of newline?

Indeed, currently that should not observably change anything (though an implementation taking that literally would report incorrect line numbers). But yeah, I think it is preferable to use the standard definition for clarity.

tlively · 2023-10-25T19:48:42Z

test/core/comments.wast

Interesting and unfortunate that git and GitHub treat this as a binary file.

I assume their heuristic is based on the appearance of 0x00 and other control characters.

rossberg added 5 commits September 25, 2023 15:13

[spec/interpreter/test] Align definition of newline with Unicode reco…

090425e

…mmendation

[test] Make runner robust wrt races

3f3b743

Merge branch 'main' into newline

0aa2da0

[test] Fix typo in runner

0196d82

Merge branch 'main' into newline

86c5425

rossberg added 2 commits October 25, 2023 09:06

Drop recognition of NEL

e2dc367

Missed one

52dc27d

tlively approved these changes Oct 25, 2023

View reviewed changes

rossberg merged commit 43d405f into main Oct 25, 2023
5 checks passed

rossberg deleted the newline branch October 25, 2023 19:55

keithw mentioned this pull request Apr 29, 2024

Wast2Json fails on the testsuite WebAssembly/wabt#2410

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spec/interpreter/test] Align definition of newline with Unicode reco… #1684

[spec/interpreter/test] Align definition of newline with Unicode reco… #1684

rossberg commented Sep 25, 2023

sunfishcode commented Oct 11, 2023

rossberg commented Oct 11, 2023

sunfishcode commented Oct 11, 2023

tlively commented Oct 16, 2023

sunfishcode commented Oct 24, 2023 •

edited

Loading

tlively commented Oct 25, 2023

rossberg commented Oct 25, 2023

tlively Oct 25, 2023

rossberg Oct 25, 2023

tlively Oct 25, 2023

rossberg Oct 25, 2023

[spec/interpreter/test] Align definition of newline with Unicode reco… #1684

[spec/interpreter/test] Align definition of newline with Unicode reco… #1684

Conversation

rossberg commented Sep 25, 2023

sunfishcode commented Oct 11, 2023

rossberg commented Oct 11, 2023

sunfishcode commented Oct 11, 2023

tlively commented Oct 16, 2023

sunfishcode commented Oct 24, 2023 • edited Loading

tlively commented Oct 25, 2023

rossberg commented Oct 25, 2023

tlively Oct 25, 2023

Choose a reason for hiding this comment

rossberg Oct 25, 2023

Choose a reason for hiding this comment

tlively Oct 25, 2023

Choose a reason for hiding this comment

rossberg Oct 25, 2023

Choose a reason for hiding this comment

sunfishcode commented Oct 24, 2023 •

edited

Loading