RFC: Unicode strings and characters #13

robin-aws · 2022-07-14T22:30:34Z

Resolves #12. Resolves dafny-lang/dafny#413.

Tagging those I suspect will be interested: @dafny-lang/dafny-core @mschlaipfer @txiang61 @seanmcl @indolering @alex-chew @seebees

Revision 2: Besides general clarification, I decided to drop support for the old \uXXXX UTF-16 code unit escape sequence with /unicodeChar:1, as keeping the support was creating confusion even though it was technically possible.

RustanLeino

Great! Let's do it!

Also, the write-up is especially nice.

I suggest an explicit mention, somewhere, that the design omits any "byte access" into string. This is on purpose, because we want to encourage Dafny users to work at the higher level of seq<char>, not dealing with the underlying representation. In places where the bytes are of importance, it is possible to write a string-to-byte-sequence function in Dafny.

RustanLeino · 2022-07-22T22:01:33Z

0012-unicode-strings.md

+
+```dafny
+method Main() {
+ var s := "Unicode is just so \ud83d\ude0e";


This line ought to give an error with /unicodeChar:1, since \ud83d is not a legal character.

Turns out I'm on board with that after all. :)

RustanLeino · 2022-07-22T22:06:16Z

0012-unicode-strings.md

+string and character literals could only contain printable and white-space ASCII characters,
+due to a limitation of the Coco/R parser generator the toolchain uses.
+This has been fixed, and both standard form and verbatim form string literals now allow any Unicode characters.
+A second form of escape sequence accepting a hexadecimal number with up to six digits, `\u{XXXXXX}`, 


Is the "up to" just for char literals? If it were to apply also to string, then does "\ua0" mean the 1-character string containing the ASCII character 160, or does it mean the 2-character string containing the ASCII characters 16 and 48?

Oh, are the curly braces required?

"\ua0" is currently not legal, and would remain not legal, because uXXXX escapes require exactly 4 characters, to avoid this exact ambiguity.

"\u{a0}" would mean the 1-char string containing the ASCII character 160.

(And as I type this I am leaning closer to making it "\U{a0}" instead :) )

RustanLeino · 2022-07-22T22:09:20Z

0012-unicode-strings.md

+
+```dafny
+// Several different ways to express the same string literal
+var s1 := "Unicode is just so \ud83d\ude0e";


I propose we forbid this under /unicodeChar:1. After all, neither \ud83d nor \ude0e falls into the numeric ranges of the new char.

I didn't see a reason to reject valid string literals using the existing escape pattern, personally. I can make this clearer in the proposal, but we can keep both forms with these semantics:

\uXXXX - UTF-16 code unit, incorrect use of surrogates rejected by the parser.

\u{X}...\u{XXXXXX} - Unicode scalar value, surrogate range values rejected by the parser.

There is precedent for supporting both, as Go allows both \uXXXX and \UXXXXXXXX. I liked the variable-length syntax of \u{X..X} (also used by Swift) better personally, as exactly requiring exactly eight digits just means the first two will always be 00 :)

I'm open to using U for the second form if it helps reduce confusion though.

Note we should avoid any implication that a single escape sequence produces a single "character", since neither form can produce 🇨🇦 that way. We will only know that a single u{X..X} sequence will produce a single char (and that's not ultimately that meaningful anyway).

Revision 2 now says we only have \U{XXXXXX} with /unicodeChar:1

Also I realized that in Go \uXXXX still specifies Unicode code points rather than UTF-16 code units, that is, it rejects surrogate code points (and hence dafny-lang/dafny#1980 :)

RustanLeino · 2022-07-22T22:09:55Z

0012-unicode-strings.md

+The exact representation of strings at runtime, including the particular encoding,
+is an implementation detail of a particular backend, and will often be optimized for the idioms and support
+of the target environment. Enabling Unicode characters will change the target language types used to


0012-unicode-strings.md

RustanLeino · 2022-07-22T22:13:48Z

0012-unicode-strings.md

+ (0 <= n && n <= 55295 /* 0xD7FF */) || 
+ (57344 /* 0xE000 */ <= n && n <= 1114111 /* 0x10FFFF */ )


Use half-open intervals.

The only reason I used closed intervals throughout is because that's what the Unicode standard itself does. It's always irked me, but as a result the constant 0xDFFF tends to be much more well-known than 0xE000.

I will absolutely use half-open intervals here as it will probably verify more efficiently, and I'm happy to change the proposal accordingly if it would help readability.

Ended up going with half-open intervals for consistency.

RustanLeino · 2022-07-22T22:14:26Z

0012-unicode-strings.md

+representing individual characters. `char` values can be converted to and from `int` values (using `as char` and `as int` expressions),
+and an `int` value corresponding to a `char` value is currently required to be a valid UTF-16 code unit, i.e. in the range
+`[0, 65536)`. This range includes the so-called ["surrogate" code points](https://unicode.org/faq/utf_bom.html#utf16-2),
+i.e. values in the range `[0xD800, 0xDFFF]`,


Continue using half-open intervals: [0xD800, 0xE000).

RustanLeino · 2022-07-22T22:15:07Z

0012-unicode-strings.md

+and are not assignable Unicode code points themselves.
+
+I propose a breaking change in Dafny 4.0, to make `char` represent any Unicode code point, independent of the encoding used.
+This means that the corresponding `int` value for a `char` must always be a [Unicode scalar value](https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf#G7404), meaning any value in the range `[0, 0x10FFFF]` but excluding the surrogate code points from `[0xD800, 0xDFFF]`.


Use half-open intervals, here and throughout: [0, 0x11_0000).

0012-unicode-strings.md

RustanLeino · 2022-07-22T22:22:39Z

0012-unicode-strings.md

+to work with the new semantics of Unicode characters.
+In the C# runtime, for example, the `Sequence.FromString(string)` method converts a native string
+to a equivalent `Dafny.ISequence<char>` copy.
+A parallel method named something similar to `Sequence.UnicodeFromString(string)` could be added


We should document and officially support the way to get to these conversion routines from Dafny. More precisely, we should give the :extern declarations one needs to write in the Dafny source to get to these routines. For C#, that may be:

type {:extern} NativeString function {:extern "Dafny.ConvertStringToNative"} StringToNative(s: string): NativeString function {:extern "Dafny.ConvertNativeToString"} NativeToString(s': NativeString): string

These functions are not total: not all C# or Javascript strings will be representable in /unicodeChar:1.

I would specifically recommend against having any NativeString type, because it won't behave consistently across runtimes. I'm recommending instead having these adaptors defined in each runtime independently, so they can either trust the underlying native string semantics or check for and reject invalid values as appropriate. I'll expand the detail on those adaptors and make this clearer.

cpitclaudel

Great work. Some high-level comments / worries

I would like to see more details about the following:

Interop: since the new string type won't be able to represent all UTF-16 code unit sequences, it would be nice to think ahead of how we will deal with ill-formed input (I'm especially thinking of Javascript strings coming from user inputs on a webpage, or windows file system paths). In particular, Rustan's proposed native string conversion functions are partial. Will we have them return Result values?
String literals. Are we going to keep using the current UTF-16 based model? The \u codes in Dafny assume UTF-16 AFAICT.
Conversions to/from sequences. I'm not sure it's the best to keep string as a seq<char>. In particular, it prevents us from defining member functions on the type string. We could support the as operator for casting to a seq<char> cheaply.
I'm not sure caching decoding to UTF-32 is efficient. We could simply cache a few (string, index) → offset pairs… even possibly as a global map (not attached to every string object).

cpitclaudel · 2022-07-25T20:01:03Z

0012-unicode-strings.md

+## Correctness
+
+The current definition of these types means that the Dafny `string` type allows data that is not a valid Unicode string.
+The value `"\uD800"`, for example, is not a valid Unicode string and has no valid encoding in UTF-8, UTF-16,


I think it would be worth trying to align this document's terminology with the standard Unicode terms. IIRC Unicode string refers to a sequence of code units (ie after encoding), so the first part is valid (it's not a valid unicode string), but the second part is unclear ("has no valid encoding")

You're right, and unfortunately I can't find any standard terminology for the abstract "sequence of scalar values" concept that is ultimately what I want string to represent.

cpitclaudel · 2022-07-25T20:02:48Z

0012-unicode-strings.md

+## Compatibility
+
+The current definitions of `string` and `char` are biased towards using a UTF-16 encoding at runtime.
+This aligns well with some compilation target languages which also use UTF-16, such as Java, C#, and JavaScript, 


But C# and Javascript don't use UTF-16, since they don't enforce well-formedness.

Fine :) Qualified a bit here but will address this more concretely in the Runtime section.

cpitclaudel · 2022-07-25T20:05:59Z

0012-unicode-strings.md

+
+```dafny
+method Main() {
+ var s := "Unicode is just so \ud83d\ude0e";


Up to this point the document hasn't addressed the representation of string literals, so I'm having trouble parsing the example. What does this string represent with unicodeChar:1?

I pulled a fair bit of content specifically on string literals to the early parts of this section, hopefully it helps!

Hopefully not supporting \uXXXX with \unicodeChar:1 any longer helps clear this up.

0012-unicode-strings.md

cpitclaudel · 2022-07-25T20:10:38Z

0012-unicode-strings.md

+to work with the new semantics of Unicode characters.
+In the C# runtime, for example, the `Sequence.FromString(string)` method converts a native string
+to a equivalent `Dafny.ISequence<char>` copy.
+A parallel method named something similar to `Sequence.UnicodeFromString(string)` could be added


These functions are not total: not all C# or Javascript strings will be representable in /unicodeChar:1.

cpitclaudel · 2022-07-25T20:29:26Z

0012-unicode-strings.md

+## Go:
+
+In Go a string is a read-only slice of bytes, which generally contains UTF-8 encoded bytes
+but may contain arbitrary bytes. The `rune` type is an alias for `int32`.


How do the rune-based API behave when given an int outside of the range of valid scalar values?

cpitclaudel · 2022-07-25T20:30:10Z

0012-unicode-strings.md

+and two new corresponding `std::u16string` and `std::u32string` collection classes.
+It also provides three new kinds of string literals,
+`u8"..."`, `u"..."`, and `U"..."`,
+which produce binary values encoded with UTF-8, UTF-16, and UTF-32 respectively.


What is a binary value?

cpitclaudel · 2022-07-25T20:30:43Z

0012-unicode-strings.md

+## C++:
+
+The `char` type represents bytes, and the `std::string` class from the standard library
+operates on bytes as character, and generally does not produce correct results if used


I don't understand this part:

operates on bytes as character

cpitclaudel · 2022-07-25T20:33:08Z

0012-unicode-strings.md

+[unresolved-questions]: #unresolved-questions
+
+Is there anything more we can do to make migration easier and safer for users?
+Chance are very good that all Dafny code in existence to date either will not change behavior


This seems overly optimistic to me WRT string literals, but I'm not sure how we plan to handle them. At the moment they are sequences of characters mixed with UTF16 code units (possibly ill-formed), right? It would be good to specify what happens to them in /unicodeChar:1

I think this is moot now but let me know otherwise.

There will be code that we won't be able to port (code using string literals with unpaired surrogates), but that's fine

Yup, the worst case scenario will be code using string to carry such data, which the conversion utility (or verification with /unicodeChar:1) will reject, and the code will have to be rewritten to use seq<uint16> instead. I'm definitely okay with that, especially as I think it's much more likely such code will be unaware of the issue with unpaired surrogates rather than intentionally using the current definition of string to allow them.

cpitclaudel · 2022-07-25T20:44:40Z

0012-unicode-strings.md

+Chance are very good that all Dafny code in existence to date either will not change behavior
+across this change, or will slightly improve because of the improved handling of surrogate code points.
+I have been unable to think of anything that would provide more value than the verifier will already provide,
+but I am open to suggestion as always!


Automatically converting old-style \u strings to new-style \u{} strings would be useful :)

That's a fair point, and although I've clarified I don't intend to drop \u (just reject invalid sequences of them), even so that conversion utility would be very easy to implement.

Now that I'm dropping \u I agree and will definitely provide a conversion utility.

cpitclaudel · 2022-07-25T20:48:39Z

0012-unicode-strings.md

+
+This new type could alternatively be introduced with a different name, such as `unicode` as in Python 2,
+while keeping the alias of `string` for `seq<char>`.
+This would only increase the confusion and cognitive burden for Dafny users in the future, though.


Can we discuss how easy it will be for external code to wrap a native string type to give it Dafny's string API? It would be nice if one could implement ISequence<char> using C# strings (plus some sanity checks), to save on conversions.

I'm going to rewrite the Compilation/Runtime section to be more clear about this (as I intended that to address exactly this point, but I don't think it's clear enough yet)

I THINK this should have enough detail but let me know otherwise.

cpitclaudel · 2022-07-25T20:51:08Z

~~As one last high-level comment:~~ The debate on whether string should be seq<char> or its own type would be moot if seq<char> was a trait instead of a type. Then string could be a separate type that implements seq<char>. I wonder how easy that change would be with @RustanLeino's new type system (part of the issue is knowing how to deal with empty sequences).

keyboardDrummer · 2022-07-26T11:16:42Z

0012-unicode-strings.md

+effort to achieve verification, or would have to be abandoned entirely in favour of the helper methods
+provided by the shared library.
+
+## Change the definition of the `string` type


The section describes downsides of having a standalone string type, but what are the upsides?

I think one aspect to consider is managing expectations. If you have type string = seq<char>, then what performance expectations will the user have of accessing individual characters of that string, and do those align with what we're providing?

With a custom string type, you can define a fresh performance contract, such as that accessing a specific character may be linear time. With that, you would be free to encode the Dafny type nativestring using native strings. I don't think that's the right thing to do, but it seems like a benefit of this approach that isn't mentioned. Or you could have a type nativestring that doesn't have an API for accessing individual characters, that way users that are processing strings but only in a simple fashion, can keep using native strings instead of the memory heavy, have to be converted when using external code, Dafny strings.

cpitclaudel · 2022-07-26T17:24:25Z

Expanding on the comment about Windows path and Javascript strings:

Scheme has had a notion of OS strings for a while: https://www.s48.org/1.8/manual/manual-Z-H-6.html#node_sec_5.15 . Its internals on Windows are essentially WTF-8.
[Here]'s the rust thread that leads to the introduction of OS strings there.

This is relevant to the discussion of adding arguments to main, too: we'll have to consider whether main takes strings or OS strings.

davidcok · 2022-07-26T18:07:07Z

0012-unicode-strings.md

+but in previous versions of Dafny,
+string and character literals could only contain printable and white-space ASCII characters,
+due to a limitation of the Coco/R parser generator the toolchain uses.
+This has been fixed, and both standard form and verbatim form string literals now allow any Unicode characters.


What has been fixed? The Coco/R parser?

This section is intended to describe the desired end state for Dafny users, so it's describing a future state. :)

I'm saying I can see where the issue is (specifically in our copy of Scanner.frame, which is creating a Buffer when it should be creating a UTF8Buffer) so I'm proposing we fix that.

FYI this part has been actually fixed now: dafny-lang/dafny#2717 :)

Co-authored-by: Rustan Leino <[email protected]> Co-authored-by: Clément Pit-Claudel <[email protected]>

cpitclaudel · 2022-08-03T01:01:06Z

0012-unicode-strings.md

+The exact representation of strings at runtime, including the particular encoding,
+is an implementation detail of a particular backend, and will often be optimized for the idioms and support
+of the target environment. 
+This also applies to string literals in Dafny source code:


This is better but still ambiguous. Maybe it's enough to say that strings literals are as in Dafny 3, except for the fact that unpaired surrogates are disallowed?

That's only true w.r.t. what literals are permitted in source text though (with the added \U{X..X} escape sequence), not how they are represented at runtime. I do want to say that, unlike in Dafny 3, users shouldn't assume a string literal becomes a sequence where every ASCII character or \u escape sequence literally becomes an individual element of that sequence. I want to call out that if you compile your program to Go, for example, your literal will likely become a Go string value wrapped as a Dafny.Sequence<rune>.

Again I'm hoping that dropping support for \u makes this clearer.

0012-unicode-strings.md

Co-authored-by: David Cok <[email protected]>

keyboardDrummer · 2022-09-29T10:39:05Z

0012-unicode-strings.md

+Migrating an existing codebase should reduce to a simple find-and-replace operation.
+
+Note that this aspect of the foreign function interface in Dafny is best handled on a per-backend basis.
+It isn't generally tractable to define a single `NativeString` type with consistent semantics,


Could you specify in more details what the limitations of a NativeString type would be?

I imagine it would be a subset of a collection type with linear time access, like

type NativeString = x: LinkedListTrait<char> | ValidInAllRuntimes(x)

The LinkedList would warn users against doing random access on individual characters, since that may take linear time depending on the runtime.

There would be some native methods on NativeString like Concat(right: NativeString): NativeString (linear in total length) and Take(n: nat): NativeString (linear in n). Something like SubString(start: nat, length: nat) might not exist because that would be linear not just in length but also in start, which is surprising.

Implementation of the design from dafny-lang/rfcs#13. Resolves #413. Fixes #2928. Fixes #818. Fixes #3058. Fixes #1293. Fixes #3001. Depends on #2976 to be fixed for the tests to pass consistently across platforms. I've documented some of the less obvious compilation strategy decisions in `docs/Compilation/StringsAndChars.md`. <small>By submitting this pull request, I confirm that my contribution is made under the terms of the [MIT license](https://github.com/dafny-lang/dafny/blob/master/LICENSE.txt).</small>

robin-aws added 7 commits July 6, 2022 17:12

First start

f64cd44

Finished front half

098ccbd

Progress

efc0ba8

Most alternatives done

2c5ae54

Remaining alternatives todo

1c9ecda

Everything filled in

9a636ed

Trailing fix ups

5f7d0e0

RustanLeino requested changes Jul 22, 2022

View reviewed changes

cpitclaudel reviewed Jul 25, 2022

View reviewed changes

robin-aws self-assigned this Jul 25, 2022

keyboardDrummer reviewed Jul 26, 2022

View reviewed changes

davidcok reviewed Jul 26, 2022

View reviewed changes

robin-aws changed the title ~~Unicode strings and characters~~ RFC: Unicode strings and characters Jul 26, 2022

robin-aws and others added 4 commits August 2, 2022 13:11

Apply suggestions from code review

d347572

Co-authored-by: Rustan Leino <[email protected]> Co-authored-by: Clément Pit-Claudel <[email protected]>

Half-open intervals

16f6a97

Clarified representation of string literals

04d2553

More details in Prior Art

931d679

cpitclaudel reviewed Aug 3, 2022

View reviewed changes

robin-aws mentioned this pull request Aug 4, 2022

Change default values of CLI options for Dafny 4.0 dafny-lang/dafny#2548

Closed

robin-aws mentioned this pull request Aug 11, 2022

Incorrect printing of generically-typed strings in Go (a.k.a. should string != seq<char>?) dafny-lang/dafny#2582

Open

robin-aws added the breaking-change Any change that will cause existing Dafny codebases to break (excluding verification instability) label Aug 12, 2022

cpitclaudel mentioned this pull request Aug 23, 2022

Invalid UTF dafny-lang/dafny#2620

Closed

robin-aws mentioned this pull request Sep 14, 2022

Request: lexicographical orderings dafny-lang/libraries#52

Open

robin-aws added 3 commits September 26, 2022 15:17

Don’t support \uXXXX with \unicodeChar:1

3b87491

Addressing remaining comments

30c7ea7

Lingering edits

cbb491e

davidcok reviewed Sep 28, 2022

View reviewed changes

Apply suggestions from code review

560f395

Co-authored-by: David Cok <[email protected]>

keyboardDrummer reviewed Sep 29, 2022

View reviewed changes

robin-aws mentioned this pull request Oct 7, 2022

Dafny string literals with surrogate characters compiled to invalid Go strings dafny-lang/dafny#1980

Closed

robin-aws mentioned this pull request Nov 15, 2022

feat: Unicode strings (--unicode-char) dafny-lang/dafny#3016

Merged

		(0 <= n && n <= 55295 /* 0xD7FF */) \|\|
		(57344 /* 0xE000 / <= n && n <= 1114111 / 0x10FFFF */ )

RFC: Unicode strings and characters #13

Are you sure you want to change the base?

RFC: Unicode strings and characters #13

Conversation

robin-aws commented Jul 14, 2022 • edited Loading

RustanLeino left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpitclaudel left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cpitclaudel commented Jul 25, 2022 • edited Loading

keyboardDrummer Jul 26, 2022 • edited Loading

Choose a reason for hiding this comment

cpitclaudel commented Jul 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keyboardDrummer Sep 29, 2022 • edited Loading

Choose a reason for hiding this comment

robin-aws commented Jul 14, 2022 •

edited

Loading

cpitclaudel commented Jul 25, 2022 •

edited

Loading

keyboardDrummer Jul 26, 2022 •

edited

Loading

keyboardDrummer Sep 29, 2022 •

edited

Loading