-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: Support for full UTF8 parsing by default. #2717
Conversation
This PR enables support for parsing any UTF-8 string in a source file and in comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So glad the fix was as simple as I suspected! :)
Let's add at least one integration test too. Something like:
method Main() {
print "Mikaël Mayer fixed UTF8 parsing!", "\n";
}
I'd like to verify whether this is enough to support arbitrary characters in string literals (I suspect it is but I'm not sure).
We also need to update the reference manual since it touches on this: https://dafny.org/dafny/DafnyRef/DafnyRef#sec-unicode.
Nope, because that issue talking about invalid uses of method Main() {
var s := "Unicode is just so 😎";
print s, "\n";
} This currently prints "Unicode is just so ð" (in C# at least), because the parser only retains the first byte (0xF0) of the emoji character's UTF8 encoding and then interprets it as a UTF16 code unit. |
I'm able to make it work with ë (the utf-8 sequence compiles the value of the two byte sequence to 240, which is interpreted as the |
No don't worry about smileys for now. :) I just wanted to understand the extent of the fix. If ë works please do add a test to lock that down, but that's enough. |
RELEASE_NOTES.md
Outdated
@@ -1,5 +1,6 @@ | |||
# Upcoming | |||
|
|||
- feat: Support for parsing UTF8 characters in code and comments (https://github.com/dafny-lang/dafny/pull/2717) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to be excruciatingly precise about what is working and what still isn't, given how much confusion Unicode tends to cause users.
AFAICT UTF-8 sequences that map to a single UTF-16 code unit (such as ∈
in your unit test) definitely round-trip successfully through parsing and printing, but I'm not convinced those that need UTF-16 surrogates do (such as 😎, which maps to 0xD83D 0xDE0E
in UTF-16 - note that I've been using that emoji as a canonical example because many other emojis actually map to multiple Unicode scalar values, and that's a layer of complexity we can avoid for now :).
It's going to be challenging to nail that in a single sentence, so I'd suggest tackling the reference manual section edit first. We'll likely want to include a link to that here too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for continuing to entertain my nit-picking.
@@ -0,0 +1,2 @@ | |||
PrintUTF8Fails.dfy(5,8): Error: invalid LogicalExpression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh wow that's worse than I thought 😎
Co-authored-by: Robin Salkeld <[email protected]>
Co-authored-by: Robin Salkeld <[email protected]>
Co-authored-by: Robin Salkeld <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️ (even if we can't quite parse that yet)
All program text other than the contents of comments, character, string and verbatim string literals | ||
consists of printable and white-space ASCII characters, | ||
consists of printable and white-space ASCII characters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change from , to . is wrong here -- it leaves a sentence fragment on the next line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you fix it in your next PR please? I can review that. Thanks for reporting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to cok-docs-TODOs, PR#2817
This PR enables support for parsing any UTF-8 string in a source file and in comments.
Is it a fix for #2620 ?
By submitting this pull request, I confirm that my contribution is made under the terms of the MIT license.