Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Unicode strings and characters #13

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Prev Previous commit
Next Next commit
Trailing fix ups
  • Loading branch information
robin-aws committed Jul 14, 2022
commit 5f7d0e0343df943f04fe6b47ff1fcd2c1ea61fda
48 changes: 29 additions & 19 deletions 0012-unicode-strings.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,11 +8,14 @@

The Dafny `string` type is an alias for the type `seq<char>`, and the `char` type is an opaque built-in type
representing individual characters. `char` values can be converted to and from `int` values (using `as char` and `as int` expressions),
and an `int` value corresponding to a `char` value is currently required to be a valid UTF-16 code point, i.e. in the range
`[0, 65536)`. This range includes the so-called "surrogate" characters, i.e. code points in the range `[0xD800, 0xDFFF]`, which must be used in pairs in order to encode some characters in UTF-16, and are not assignable Unicode code points themselves.
and an `int` value corresponding to a `char` value is currently required to be a valid UTF-16 code unit, i.e. in the range
`[0, 65536)`. This range includes the so-called ["surrogate" code points](https://unicode.org/faq/utf_bom.html#utf16-2),
i.e. values in the range `[0xD800, 0xDFFF]`,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continue using half-open intervals: [0xD800, 0xE000).

which must be used in pairs in order to encode some characters in UTF-16,
and are not assignable Unicode code points themselves.

I propose a breaking change in Dafny 4.0, to make `char` represent any Unicode character, independent of the encoding used.
This means that the corresponding `int` value for a `char` must always be a [Unicode scalar value](https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf#G7404), meaning any value in the range `[0, 0x10FFFF]` but excluding the surrogate characters from `[0xD800, 0xDFFF]`.
I propose a breaking change in Dafny 4.0, to make `char` represent any Unicode code point, independent of the encoding used.
This means that the corresponding `int` value for a `char` must always be a [Unicode scalar value](https://www.unicode.org/versions/Unicode14.0.0/ch03.pdf#G7404), meaning any value in the range `[0, 0x10FFFF]` but excluding the surrogate code points from `[0xD800, 0xDFFF]`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use half-open intervals, here and throughout: [0, 0x11_0000).


# Motivation
[motivation]: #motivation
Expand All @@ -22,7 +25,7 @@ The two primary motivations behind this change are correctness and compatibility
## Correctness

The current definition of these types means that the Dafny `string` type allows data that is not a valid Unicode string.
The value `[0xD800 as char]`, for example, is not a valid Unicode string and has no valid encoding in UTF-8, UTF-16,
The value `"\uD800"`, for example, is not a valid Unicode string and has no valid encoding in UTF-8, UTF-16,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be worth trying to align this document's terminology with the standard Unicode terms. IIRC Unicode string refers to a sequence of code units (ie after encoding), so the first part is valid (it's not a valid unicode string), but the second part is unclear ("has no valid encoding")

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, and unfortunately I can't find any standard terminology for the abstract "sequence of scalar values" concept that is ultimately what I want string to represent.

or any other encoding.
This means that any logic to process strings must manually specify pre-conditions to exclude invalid values, or
be less precise with its specification. For example, an implementation of UTF-8 encoding would have to reject
Expand All @@ -33,7 +36,7 @@ discarding or replacing invalid characters.

The current definitions of `string` and `char` are biased towards using a UTF-16 encoding at runtime.
This aligns well with some compilation target languages which also use UTF-16, such as Java, C#, and JavaScript,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But C# and Javascript don't use UTF-16, since they don't enforce well-formedness.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine :) Qualified a bit here but will address this more concretely in the Runtime section.

but less well with those that use the more popular UTF-8 encoding, such as Go or Rust.
but less well with those that use the recently more popular UTF-8 encoding, such as Go or Rust.
Any Dafny code that interfaces with external code will often have to convert between `string` values and
native representations of strings, and baking in the assumption of UTF-16 imposes a complexity and performance
penalty for multiple target environments.
Expand Down Expand Up @@ -78,14 +81,14 @@ This has been fixed, and both standard form and verbatim form string literals no
A second form of escape sequence accepting a hexadecimal number with up to six digits, `\u{XXXXXX}`,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the "up to" just for char literals? If it were to apply also to string, then does "\ua0" mean the 1-character string containing the ASCII character 160, or does it mean the 2-character string containing the ASCII characters 16 and 48?

Oh, are the curly braces required?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"\ua0" is currently not legal, and would remain not legal, because uXXXX escapes require exactly 4 characters, to avoid this exact ambiguity.

"\u{a0}" would mean the 1-char string containing the ASCII character 160.

(And as I type this I am leaning closer to making it "\U{a0}" instead :) )

is now provided to support characters outside of the Basic Multilingual Plane
using their direct Unicode code points instead of using surrogates.
This change is fully backwards-compatible and not controlled by the `unicodeChar` flag.
These changes are fully backwards-compatible and not controlled by the `unicodeChar` flag.

```dafny
// Several different ways to express the same string literal
var s1 := "Unicode is just so \ud83d\ude0e";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose we forbid this under /unicodeChar:1. After all, neither \ud83d nor \ude0e falls into the numeric ranges of the new char.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't see a reason to reject valid string literals using the existing escape pattern, personally. I can make this clearer in the proposal, but we can keep both forms with these semantics:

  1. \uXXXX - UTF-16 code unit, incorrect use of surrogates rejected by the parser.
  2. \u{X}...\u{XXXXXX} - Unicode scalar value, surrogate range values rejected by the parser.

There is precedent for supporting both, as Go allows both \uXXXX and \UXXXXXXXX. I liked the variable-length syntax of \u{X..X} (also used by Swift) better personally, as exactly requiring exactly eight digits just means the first two will always be 00 :)

I'm open to using U for the second form if it helps reduce confusion though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note we should avoid any implication that a single escape sequence produces a single "character", since neither form can produce 🇨🇦 that way. We will only know that a single u{X..X} sequence will produce a single char (and that's not ultimately that meaningful anyway).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revision 2 now says we only have \U{XXXXXX} with /unicodeChar:1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I realized that in Go \uXXXX still specifies Unicode code points rather than UTF-16 code units, that is, it rejects surrogate code points (and hence dafny-lang/dafny#1980 :)

var s2 := "Unicode is just so \u{1F60E}";
var s3 := "Unicode is just so 😎";
var s4 := @"Unicode is just so 😎"; // Escape sequences not supported in verbatim strings
var s4 := @"Unicode is just so 😎"; // Escape sequences are not supported in verbatim strings
```

The exact representation of strings at runtime, including the particular encoding,
Expand All @@ -95,8 +98,9 @@ represent characters and strings, and hence may cause compilation errors when us
target language code.

Note also that although the Unicode scalar value concept is more general than UTF-16 code units,
it still does not always correspond to what humans will perceive as single atomic characters when rendered;
see the concept of grapheme clusters [here](https://unicode.org/reports/tr29/) for more details.
it still does not always correspond to what humans will perceive as single atomic characters when rendered.
For example, the string `"e\u0301"` contains two Unicode scalar values, but renders as the single character `é`.
See the concept of grapheme clusters [here](https://unicode.org/reports/tr29/) for more details.
The proposed change to the `char` type is only intended to allow the Dafny language to safely abstract
away from encodings, especially to support verifiably-correct code that must compile to multiple target languages.
robin-aws marked this conversation as resolved.
Show resolved Hide resolved
Providing more of the concepts defined by the Unicode standard is left out of scope for this proposal,
Expand Down Expand Up @@ -204,7 +208,7 @@ strings to various target languages that do not necessarily use UTF-16.

The next cheapest solution would be to define a subset type of `string` that enforces
correct usage of surrogate characters.
Similar such definitions already exist for valid [UTF-8](https://github.com/aws/aws-encryption-sdk-dafny/blob/mainline/src/Util/UTF8.dfy#L21).
Similar such definitions already exist for valid [UTF-8](https://github.com/aws/aws-encryption-sdk-dafny/blob/mainline/src/Util/UTF8.dfy#L21)
and [UTF-16](https://github.com/dafny-lang/libraries/blob/master/src/Unicode/Utf16EncodingForm.dfy) code unit sequences
in some Dafny codebases.
This also introduces the same proof obligations as the previous solution,
Expand Down Expand Up @@ -235,20 +239,17 @@ Although unusual, representing strings directly as sequences of abstract charact
works very well for a verification-focussed language like Dafny,
robin-aws marked this conversation as resolved.
Show resolved Hide resolved
since sequences are already a deep built-in concept in the language semantics.

This new type could alternatively be introduced with a different name, while keeping the alias of `string` for `seq<char>`.
This new type could alternatively be introduced with a different name, such as `unicode` as in Python 2,
while keeping the alias of `string` for `seq<char>`.
This would only increase the confusion and cognitive burden for Dafny users in the future, though.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we discuss how easy it will be for external code to wrap a native string type to give it Dafny's string API? It would be nice if one could implement ISequence<char> using C# strings (plus some sanity checks), to save on conversions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to rewrite the Compilation/Runtime section to be more clear about this (as I intended that to address exactly this point, but I don't think it's clear enough yet)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I THINK this should have enough detail but let me know otherwise.

It may be a good idea, however, for a shared Dafny source library to define a `EncodedString` wrapper
around encoded bytes that includes a ghost `string` value defining the actual abstract string encoded.
This could make the [existing implementations of UTF-8 and UTF-16 encodings](https://github.com/dafny-lang/libraries/tree/master/src/Unicode)
more efficient and pleasant to use.

## Add a new, distinct `rune` type

We could maintain the current definition of the `char` type, and introduce a new `rune` type to represent Unicode scalar values
instead ("rune" being the term both Go and C# use for this).
This would make it more obvious when reading Dafny source code in isolation whether it was using the new definitions of strings.
There are often few explicit references to individual character values in code that references string values, however,
and even when the `char` type is used it can often be only implicitly referenced because of type inference.
and even when the `char` type is used it is often only implicitly referenced because of type inference.
This alternative does not seem to be worth the implementation cost or additional cognitive load on users,
especially since it is hard to imagine any codebase needing to use both `char` and `rune` together.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with char is that in unicode "character" refers to something that's pretty different from a scalar value (in particular, a "character" in unicode can be made up of more than one "scalar value").

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to say I would personally find it confusing to read code that uses both char to represent UTF-16 code units and rune to represent Unicode scalar values. Code like https://github.com/dafny-lang/libraries/tree/master/src/Unicode specifically doesn't connect the built-in char type to anything because it generalizes over multiple encoding forms and UTF-16 isn't special in that library.

You're also absolutely right that even with /unicodeChar:1, char still doesn't correspond to an abstract Unicode character. Perhaps you're arguing that we should deprecate char and replace it with rune to more aggressively avoid that confusion? I can add that as another option here: just deprecating char for now wouldn't be as disruptive as some of these options and could be worth it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this (never using both char and rune): IIRC Windows file system APIs use sequences of Dafny-3 chars, not valid utf-16 strings.

Maybe this just means that we'll have to build file system APIs with seq<byte> instead of string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if you need to manipulate possibly-invalid sequences of UTF-16 code points, that should probably be a seq<uint16> instead.

I'm personally happy with the trade-off of a simpler, maximally correct and portable definition of string at the expense of making some specific APIs more ugly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an explicit note on this, and an open question on being able to compile to a native utf-16 type.


Expand Down Expand Up @@ -278,7 +279,7 @@ the definition of how scalar values are grouped into grapheme clusters
is much more complicated and locale-specific than the simple mathematical rules of common encodings
like UTF-8 and UTF-16. This means a higher cost and risk of bugs in adding these concepts
to the core definition of Dafny, and a much higher likelihood of having to change
the language definition or implementation in future versions of Unicode.
the language definition or implementation to support future versions of Unicode.
It would also mean `char` can no longer be represented as a single integer value,
instead needing multiple scalar values in some cases,
which would be a large a deep change to the encoding of `char` in the Boogie prelude and translator.
robin-aws marked this conversation as resolved.
Show resolved Hide resolved
Expand Down Expand Up @@ -355,7 +356,7 @@ and unusually defines `Character` as an [*extended grapheme cluster*](https://un
which is the closest concept to an atomic characeter as perceived visually by users.
Unicode scalar values are represented with a distinct `Unicode.Scalar` structure instead.
`String` provides `utf8`, `utf16`, and `unicodeScalars` views
that present the contents as sequences of UTF-8, UTF-16, or UTF-32 code units respectively.
that present the contents as sequences of UTF-8, UTF-16, and UTF-32 code units respectively.

# Unresolved questions
[unresolved-questions]: #unresolved-questions
Expand All @@ -365,3 +366,12 @@ Chance are very good that all Dafny code in existence to date either will not ch
across this change, or will slightly improve because of the improved handling of surrogate code points.
I have been unable to think of anything that would provide more value than the verifier will already provide,
but I am open to suggestion as always!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automatically converting old-style \u strings to new-style \u{} strings would be useful :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point, and although I've clarified I don't intend to drop \u (just reject invalid sequences of them), even so that conversion utility would be very easy to implement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I'm dropping \u I agree and will definitely provide a conversion utility.


# Future possibilities

## `EncodedString` type

It may be a good idea for a shared Dafny source library to define a `EncodedString` wrapper
around encoded bytes that includes a ghost `string` value defining the actual abstract string encoded.
This could make the [existing implementations of UTF-8 and UTF-16 encodings](https://github.com/dafny-lang/libraries/tree/master/src/Unicode)
more efficient and pleasant to use.