Feat: Support for full UTF8 parsing by default. #2717

MikaelMayer · 2022-09-08T20:18:46Z

This PR enables support for parsing any UTF-8 string in a source file and in comments.

Is it a fix for #2620 ?

By submitting this pull request, I confirm that my contribution is made under the terms of the MIT license.

This PR enables support for parsing any UTF-8 string in a source file and in comments.

robin-aws

So glad the fix was as simple as I suspected! :)

Let's add at least one integration test too. Something like:

method Main() {
  print "Mikaël Mayer fixed UTF8 parsing!", "\n";
}

I'd like to verify whether this is enough to support arbitrary characters in string literals (I suspect it is but I'm not sure).

We also need to update the reference manual since it touches on this: https://dafny.org/dafny/DafnyRef/DafnyRef#sec-unicode.

robin-aws · 2022-09-08T21:20:05Z

Is it a fix for #2620 ?

Nope, because that issue talking about invalid uses of \u escape sequences. But you may be fixing the separate issue that directly included non-ASCII characters are truncated by the parser:

method Main() {
  var s := "Unicode is just so 😎";
  print s, "\n";
}

This currently prints "Unicode is just so ð" (in C# at least), because the parser only retains the first byte (0xF0) of the emoji character's UTF8 encoding and then interprets it as a UTF16 code unit.

…t-full-utf8-parsing

MikaelMayer · 2022-09-13T15:46:53Z

I'm able to make it work with ë (the utf-8 sequence compiles the value of the two byte sequence to 240, which is interpreted as the ë in ASCII)
but not the smiley. I tried my best for 30 minutes to change the scanner so that the smiley would pass, but 1) although it's one character, it takes two columns on an editor and 2) it would require not casting from (int) to (char), and I haven't found the good recipe.
So, do we want smileys for now, or should I just be happy with this change for now?

robin-aws · 2022-09-13T22:38:03Z

No don't worry about smileys for now. :) I just wanted to understand the extent of the fix. If ë works please do add a test to lock that down, but that's enough.

robin-aws · 2022-09-14T16:26:53Z

RELEASE_NOTES.md

@@ -1,5 +1,6 @@
 # Upcoming

+- feat: Support for parsing UTF8 characters in code and comments (https://github.com/dafny-lang/dafny/pull/2717)


We need to be excruciatingly precise about what is working and what still isn't, given how much confusion Unicode tends to cause users.

AFAICT UTF-8 sequences that map to a single UTF-16 code unit (such as ∈ in your unit test) definitely round-trip successfully through parsing and printing, but I'm not convinced those that need UTF-16 surrogates do (such as 😎, which maps to 0xD83D 0xDE0E in UTF-16 - note that I've been using that emoji as a canonical example because many other emojis actually map to multiple Unicode scalar values, and that's a layer of complexity we can avoid for now :).

It's going to be challenging to nail that in a single sentence, so I'd suggest tackling the reference manual section edit first. We'll likely want to include a link to that here too.

Test/dafny0/PrintUTF8.dfy

robin-aws

Thanks for continuing to entertain my nit-picking.

robin-aws · 2022-09-15T23:10:01Z

Test/dafny0/PrintUTF8Fails.dfy.expect

@@ -0,0 +1,2 @@
+PrintUTF8Fails.dfy(5,8): Error: invalid LogicalExpression


Oh wow that's worse than I thought 😎

docs/DafnyRef/Grammar.md

RELEASE_NOTES.md

Co-authored-by: Robin Salkeld <[email protected]>

robin-aws

❤️ (even if we can't quite parse that yet)

davidcok · 2022-09-20T23:33:01Z

docs/DafnyRef/Grammar.md

 All program text other than the contents of comments, character, string and verbatim string literals
-consists of printable and white-space ASCII characters,
+consists of printable and white-space ASCII characters.


The change from , to . is wrong here -- it leaves a sentence fragment on the next line.

Can you fix it in your next PR please? I can review that. Thanks for reporting.

Added to cok-docs-TODOs, PR#2817

Feat: Support for full UTF8 parsing by default.

7bcb046

This PR enables support for parsing any UTF-8 string in a source file and in comments.

MikaelMayer self-assigned this Sep 8, 2022

MikaelMayer requested a review from robin-aws September 8, 2022 20:18

Update RELEASE_NOTES.md

065c7e0

robin-aws requested changes Sep 8, 2022

View reviewed changes

MikaelMayer added 2 commits September 12, 2022 12:01

Merge branch 'master' into feat-full-utf8-parsing

a07f290

Merge remote-tracking branch 'origin/feat-full-utf8-parsing' into fea…

9279e12

…t-full-utf8-parsing

MikaelMayer added 2 commits September 14, 2022 09:17

Added a test case

6cb6d5c

Merge branch 'master' into feat-full-utf8-parsing

e3a540f

MikaelMayer enabled auto-merge (squash) September 14, 2022 14:18

MikaelMayer requested a review from robin-aws September 14, 2022 14:18

robin-aws requested changes Sep 14, 2022

View reviewed changes

MikaelMayer added 2 commits September 15, 2022 09:57

Better release notes

25d1860

Added failing tests

1e6278a

MikaelMayer requested a review from robin-aws September 15, 2022 14:57

robin-aws requested changes Sep 15, 2022

View reviewed changes

MikaelMayer and others added 4 commits September 16, 2022 14:54

Update docs/DafnyRef/Grammar.md

750f5c6

Co-authored-by: Robin Salkeld <[email protected]>

Update docs/DafnyRef/Grammar.md

dfedbfe

Co-authored-by: Robin Salkeld <[email protected]>

Update RELEASE_NOTES.md

7c04f68

Co-authored-by: Robin Salkeld <[email protected]>

Merge branch 'master' into feat-full-utf8-parsing

2825366

MikaelMayer requested a review from robin-aws September 16, 2022 19:54

MikaelMayer added 2 commits September 20, 2022 15:01

Merge branch 'master' into feat-full-utf8-parsing

3728aad

Merge branch 'master' into feat-full-utf8-parsing

4e8dfba

robin-aws approved these changes Sep 20, 2022

View reviewed changes

Merge branch 'master' into feat-full-utf8-parsing

f7cf534

MikaelMayer merged commit d0f6769 into master Sep 20, 2022

MikaelMayer deleted the feat-full-utf8-parsing branch September 20, 2022 23:17

davidcok reviewed Sep 20, 2022

View reviewed changes

robin-aws mentioned this pull request Sep 22, 2022

RFC: Unicode strings and characters dafny-lang/rfcs#13

Open

robin-aws mentioned this pull request Oct 27, 2022

Dafny does not support unicode characters (outside of escape sequences) #818

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: Support for full UTF8 parsing by default. #2717

Feat: Support for full UTF8 parsing by default. #2717

MikaelMayer commented Sep 8, 2022

robin-aws left a comment

robin-aws commented Sep 8, 2022

MikaelMayer commented Sep 13, 2022

robin-aws commented Sep 13, 2022

robin-aws Sep 14, 2022

robin-aws left a comment

robin-aws Sep 15, 2022

robin-aws left a comment

davidcok Sep 20, 2022

MikaelMayer Sep 21, 2022

davidcok Sep 30, 2022

		@@ -1,5 +1,6 @@
		# Upcoming

		- feat: Support for parsing UTF8 characters in code and comments (https://github.com/dafny-lang/dafny/pull/2717)

		@@ -0,0 +1,2 @@
		PrintUTF8Fails.dfy(5,8): Error: invalid LogicalExpression

Feat: Support for full UTF8 parsing by default. #2717

Feat: Support for full UTF8 parsing by default. #2717

Conversation

MikaelMayer commented Sep 8, 2022

robin-aws left a comment

Choose a reason for hiding this comment

robin-aws commented Sep 8, 2022

MikaelMayer commented Sep 13, 2022

robin-aws commented Sep 13, 2022

robin-aws Sep 14, 2022

Choose a reason for hiding this comment

robin-aws left a comment

Choose a reason for hiding this comment

robin-aws Sep 15, 2022

Choose a reason for hiding this comment

robin-aws left a comment

Choose a reason for hiding this comment

davidcok Sep 20, 2022

Choose a reason for hiding this comment

MikaelMayer Sep 21, 2022

Choose a reason for hiding this comment

davidcok Sep 30, 2022

Choose a reason for hiding this comment