Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Unicode strings and characters #13

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Prev Previous commit
Next Next commit
Apply suggestions from code review
Co-authored-by: Rustan Leino <[email protected]>
Co-authored-by: Clément Pit-Claudel <[email protected]>
  • Loading branch information
3 people committed Aug 2, 2022
commit d347572cb6780909cb17365a4088f0e30c3d857e
8 changes: 4 additions & 4 deletions 0012-unicode-strings.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,15 +94,15 @@ var s4 := @"Unicode is just so 😎"; // Escape sequences are not supported in
The exact representation of strings at runtime, including the particular encoding,
is an implementation detail of a particular backend, and will often be optimized for the idioms and support
of the target environment. Enabling Unicode characters will change the target language types used to
represent characters and strings, and hence may cause compilation errors when using additional external
represent characters and strings, and hence may be a breaking change when using additional external
target language code.

Note also that although the Unicode scalar value concept is more general than UTF-16 code units,
it still does not always correspond to what humans will perceive as single atomic characters when rendered.
For example, the string `"e\u0301"` contains two Unicode scalar values, but renders as the single character `é`.
See the concept of grapheme clusters [here](https://unicode.org/reports/tr29/) for more details.
The proposed change to the `char` type is only intended to allow the Dafny language to safely abstract
away from encodings, especially to support verifiably-correct code that must compile to multiple target languages.
away from encodings, especially to support correct code that must compile to multiple target languages.
Providing more of the concepts defined by the Unicode standard is left out of scope for this proposal,
under the assumption that it will enable such implementations via Dafny source code in sharable libraries instead.

Expand Down Expand Up @@ -144,7 +144,7 @@ e.g. between `Dafny.ISequence<char>` and `System.String` in C#.
Even there, the various Dafny runtime libraries provide helper methods that can be adjusted
to work with the new semantics of Unicode characters.
In the C# runtime, for example, the `Sequence.FromString(string)` method converts a native string
to a equivalent `Dafny.ISequence<char>` copy.
to an equivalent `Dafny.ISequence<char>` copy.
A parallel method named something similar to `Sequence.UnicodeFromString(string)` could be added
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document and officially support the way to get to these conversion routines from Dafny. More precisely, we should give the :extern declarations one needs to write in the Dafny source to get to these routines. For C#, that may be:

type {:extern} NativeString
function {:extern "Dafny.ConvertStringToNative"} StringToNative(s: string): NativeString
function {:extern "Dafny.ConvertNativeToString"} NativeToString(s': NativeString): string

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These functions are not total: not all C# or Javascript strings will be representable in /unicodeChar:1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would specifically recommend against having any NativeString type, because it won't behave consistently across runtimes. I'm recommending instead having these adaptors defined in each runtime independently, so they can either trust the underlying native string semantics or check for and reject invalid values as appropriate. I'll expand the detail on those adaptors and make this clearer.

that would return a `Dafny.ISequence<uint32>` copy instead.
Migrating an existing codebase should reduce to a simple find-and-replace operation.
Expand Down Expand Up @@ -236,7 +236,7 @@ that is at least adequate to replace all existing usage of sequence operations.
These interfaces often become very large to support all common string operations efficiently,
and would need very careful thought.
Although unusual, representing strings directly as sequences of abstract characters
works very well for a verification-focussed language like Dafny,
works very well for a verification-focused language like Dafny,
since sequences are already a deep built-in concept in the language semantics.

This new type could alternatively be introduced with a different name, such as `unicode` as in Python 2,
Expand Down