Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Unicode strings and characters #13

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Prev Previous commit
Next Next commit
Remaining alternatives todo
  • Loading branch information
robin-aws committed Jul 13, 2022
commit 1c9ecda4fe414dd333b5fd556abaeb36ad2e42d7
13 changes: 8 additions & 5 deletions 0012-unicode-strings.md
Original file line number Diff line number Diff line change
Expand Up @@ -170,11 +170,11 @@ The cheapest solution to implement would likely be to add a new requirement in t
that surrogate `char` values are only used in pairs as intended.
This would likely require more user effort to migrate their code, however,
as every operation on strings would now have to prove that a surrogate pair
is not being split up.
is not being split up.
It also does not reduce the complexity of compiling
strings to various target languages that do not necessarily use UTF-16.

## Define "valid UTF-16 strings" as a predicate in a shared library
## Define "valid UTF-16 strings" as a subset type in a shared library

The next cheapest solution would be to define a subset type of `string` that enforces
correct usage of surrogate characters.
Expand All @@ -185,7 +185,10 @@ This also introduces the same proof obligations as the previous solution,
although here they could be proved by numerous helper methods that operate on the subset type
in the same shared library.

***TODO***
This would likely be the best solution if changing the Dafny language itself was not an option.
The major downside is that built-in sequence operations such as `s[i]` and `s + s'` would still require additional
effort to achieve verification, or would have to be abandoned entirely in favour of the helper methods
provided by the shared library.

## Change the definition of the string type

Expand Down Expand Up @@ -223,10 +226,10 @@ and even when the `char` type is used it can often be only implicitly referenced
This alternative does not seem to be worth the implementation cost or additional cognitive load on users,
especially since it is hard to imagine any codebase needing to use both `char` and `rune` together.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with char is that in unicode "character" refers to something that's pretty different from a scalar value (in particular, a "character" in unicode can be made up of more than one "scalar value").

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to say I would personally find it confusing to read code that uses both char to represent UTF-16 code units and rune to represent Unicode scalar values. Code like https://github.com/dafny-lang/libraries/tree/master/src/Unicode specifically doesn't connect the built-in char type to anything because it generalizes over multiple encoding forms and UTF-16 isn't special in that library.

You're also absolutely right that even with /unicodeChar:1, char still doesn't correspond to an abstract Unicode character. Perhaps you're arguing that we should deprecate char and replace it with rune to more aggressively avoid that confusion? I can add that as another option here: just deprecating char for now wouldn't be as disruptive as some of these options and could be worth it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this (never using both char and rune): IIRC Windows file system APIs use sequences of Dafny-3 chars, not valid utf-16 strings.

Maybe this just means that we'll have to build file system APIs with seq<byte> instead of string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if you need to manipulate possibly-invalid sequences of UTF-16 code points, that should probably be a seq<uint16> instead.

I'm personally happy with the trade-off of a simpler, maximally correct and portable definition of string at the expense of making some specific APIs more ugly.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an explicit note on this, and an open question on being able to compile to a native utf-16 type.


## Disallow compiling s[i]
## Disallow compilation of string indexing

Given popular Unicode encodings such as UTF-8 and UTF-16 do not support random access of characters,
we could decide to forbid random access of string elements (i.e. `s[i]`) in compiled code,
we could decide to forbid random access of string elements (e.g. `s[i]`) in compiled code,
instead directing users to fallback to external code for efficient access,
or to sequential access via the new [ordered quantification features](https://github.com/dafny-lang/rfcs/pull/10) (once they are implemented).
This would be a much more disruptive breaking change for existing code, however.
Expand Down