Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: Support for full UTF8 parsing by default. #2717

Merged
merged 15 commits into from
Sep 20, 2022

Conversation

MikaelMayer
Copy link
Member

This PR enables support for parsing any UTF-8 string in a source file and in comments.

Is it a fix for #2620 ?

By submitting this pull request, I confirm that my contribution is made under the terms of the MIT license.

This PR enables support for parsing any UTF-8 string in a source file and in comments.
@MikaelMayer MikaelMayer self-assigned this Sep 8, 2022
Copy link
Member

@robin-aws robin-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So glad the fix was as simple as I suspected! :)

Let's add at least one integration test too. Something like:

method Main() {
  print "Mikaël Mayer fixed UTF8 parsing!", "\n";
}

I'd like to verify whether this is enough to support arbitrary characters in string literals (I suspect it is but I'm not sure).

We also need to update the reference manual since it touches on this: https://dafny.org/dafny/DafnyRef/DafnyRef#sec-unicode.

@robin-aws
Copy link
Member

Is it a fix for #2620 ?

Nope, because that issue talking about invalid uses of \u escape sequences. But you may be fixing the separate issue that directly included non-ASCII characters are truncated by the parser:

method Main() {
  var s := "Unicode is just so 😎";
  print s, "\n";
}

This currently prints "Unicode is just so ð" (in C# at least), because the parser only retains the first byte (0xF0) of the emoji character's UTF8 encoding and then interprets it as a UTF16 code unit.

@MikaelMayer
Copy link
Member Author

I'm able to make it work with ë (the utf-8 sequence compiles the value of the two byte sequence to 240, which is interpreted as the ë in ASCII)
but not the smiley. I tried my best for 30 minutes to change the scanner so that the smiley would pass, but 1) although it's one character, it takes two columns on an editor and 2) it would require not casting from (int) to (char), and I haven't found the good recipe.
So, do we want smileys for now, or should I just be happy with this change for now?

@robin-aws
Copy link
Member

No don't worry about smileys for now. :) I just wanted to understand the extent of the fix. If ë works please do add a test to lock that down, but that's enough.

RELEASE_NOTES.md Outdated
@@ -1,5 +1,6 @@
# Upcoming

- feat: Support for parsing UTF8 characters in code and comments (https://github.com/dafny-lang/dafny/pull/2717)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be excruciatingly precise about what is working and what still isn't, given how much confusion Unicode tends to cause users.

AFAICT UTF-8 sequences that map to a single UTF-16 code unit (such as in your unit test) definitely round-trip successfully through parsing and printing, but I'm not convinced those that need UTF-16 surrogates do (such as 😎, which maps to 0xD83D 0xDE0E in UTF-16 - note that I've been using that emoji as a canonical example because many other emojis actually map to multiple Unicode scalar values, and that's a layer of complexity we can avoid for now :).

It's going to be challenging to nail that in a single sentence, so I'd suggest tackling the reference manual section edit first. We'll likely want to include a link to that here too.

Test/dafny0/PrintUTF8.dfy Show resolved Hide resolved
Copy link
Member

@robin-aws robin-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for continuing to entertain my nit-picking.

@@ -0,0 +1,2 @@
PrintUTF8Fails.dfy(5,8): Error: invalid LogicalExpression
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow that's worse than I thought 😎

docs/DafnyRef/Grammar.md Outdated Show resolved Hide resolved
docs/DafnyRef/Grammar.md Outdated Show resolved Hide resolved
RELEASE_NOTES.md Outdated Show resolved Hide resolved
Copy link
Member

@robin-aws robin-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ (even if we can't quite parse that yet)

@MikaelMayer MikaelMayer merged commit d0f6769 into master Sep 20, 2022
@MikaelMayer MikaelMayer deleted the feat-full-utf8-parsing branch September 20, 2022 23:17
All program text other than the contents of comments, character, string and verbatim string literals
consists of printable and white-space ASCII characters,
consists of printable and white-space ASCII characters.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change from , to . is wrong here -- it leaves a sentence fragment on the next line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you fix it in your next PR please? I can review that. Thanks for reporting.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to cok-docs-TODOs, PR#2817

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants