Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Unicode strings and characters #13

Open
wants to merge 15 commits into
base: master
Choose a base branch
from
Prev Previous commit
Next Next commit
Addressing remaining comments
  • Loading branch information
robin-aws committed Sep 26, 2022
commit 30c7ea7dc750dbe0d85c4f961699c7c922f97650
67 changes: 53 additions & 14 deletions 0012-unicode-strings.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,7 @@ Note that a related but largely orthogonal issue [has been fixed
Dafny source code files must be encoded in UTF-8, but in previous versions of Dafny,
string and character literals could only contain printable and white-space ASCII characters,
due to a limitation of the Coco/R parser generator the toolchain uses.
Although this is now fixed, the scanner still downcasts Unicode code points to the C# `char` type
Although this is now fixed, the scanner still downcasts Unicode code points to the 16-bit C# `char` type
and hence truncates many character values.
This will be addressed as well such that values outside the Basic Multilingual Plane
can also be used directly in string literals:
Expand Down Expand Up @@ -139,7 +139,6 @@ function char#IsUnicodeScalarValue(n: int): bool {
As described above in the user-facing description of this feature, this will cause verification failures by design.
Unlike many other programming languages that have added Unicode support over a major version bump,
Dafny is unusually well-positioned to catch most regressions statically.
Given this, it does not seem worth the effort to build any additional migration features to help customers adopt this change.

## Compilation/Runtime

Expand All @@ -164,7 +163,15 @@ A parallel method named something similar to `Sequence.UnicodeFromString(string)
that would return a `Dafny.ISequence<uint32>` copy instead.
Migrating an existing codebase should reduce to a simple find-and-replace operation.

Each runtime should also make it possible to wrap an idiomatic representation of a Unicode string
Note that this aspect of the foreign function interface in Dafny is best handled on a per-backend basis.
It isn't generally tractable to define a single `NativeString` type with consistent semantics,
Copy link
Member

@keyboardDrummer keyboardDrummer Sep 29, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you specify in more details what the limitations of a NativeString type would be?

I imagine it would be a subset of a collection type with linear time access, like

type NativeString = x: LinkedListTrait<char> | ValidInAllRuntimes(x)

The LinkedList would warn users against doing random access on individual characters, since that may take linear time depending on the runtime.

There would be some native methods on NativeString like Concat(right: NativeString): NativeString (linear in total length) and Take(n: nat): NativeString (linear in n). Something like SubString(start: nat, length: nat) might not exist because that would be linear not just in length but also in start, which is surprising.

or even a consistent API, across all supported backends.
There are even substantial differences across languages that both encode strings in UTF-8, for example:
a Rust string is guaranteed to hold valid UTF-8 data, whereas a Go string is not,
and hence one conversion function may fail where another cannot.

The above conversion functions currently copy data,
but each runtime should instead make it possible to wrap an idiomatic representation of a Unicode string
as the runtime representation of the Dafny `seq<char>` type, without having to make an additional copy.
At the time of writing this, only the Java runtime supports this feature,
via a `StringDafnySequence` class that wraps a `java.lang.String` value and implements the `ISequence<Character>` interface.
Expand All @@ -174,16 +181,27 @@ The one wrinkle with this feature is that methods such as `select(i)`, which imp
currently take constant time on an `StringDafnySequence`, since a `java.lang.String` also stores its contents as UTF-16
code units. However, if `UnicodeStringDafnySequence.select(i)` will need to produce the i'th Unicode scalar value instead,
this will require decoding all characters up to that index, which will take linear time in general instead.
The solution will be to lazily decode strings into their UTF-32 encoded equivalent (i.e. sequence of scalar values)
on the first operation that requires it. This mirrors the optimization of evaluating concatenation of sequences
lazily in the current C# and Java runtimes, and means that in most code that manipulates individual characters,
the amortized runtime of indexing operations will still be constant.

Note that this optimization will also help offset the fact that materializing and manipulating individual `char` values
There are at two three possible approaches a given backend can take:
robin-aws marked this conversation as resolved.
Show resolved Hide resolved

1. Cache the most recently calculated index into the string value.
On the next indexing operation, decode forwards or backwards from this position
to the requested index.
This would allow a series of indexing operations in non-decreasing (or non-increasing) order
to take amortized linear time, and only perform poorly for truly random access.

2. Lazily decode strings into their UTF-32 encoded equivalent (i.e. sequence of scalar values)
on the first operation that requires it. This mirrors the optimization of evaluating concatenation of sequences
lazily in the current C# and Java runtimes, and means that in most code that manipulates individual characters,
the amortized runtime of indexing operations will still be constant.

Note that either optimization will also help offset the fact that materializing and manipulating individual `char` values
will in general require twice as much memory after this change.
The compiler can leverage this support to initially store string literal values as UTF-8 bytes,
The compiler can leverage this support to initially store string literal values in UTF-8 or UTF-16,
and operations such as concatenation or equality comparisons can be implemented
without having to decode these bytes into character values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposal does not address distinguishing seq<char> and string at the language level. Do we want to keep string an alias of seq<char>? Will that not make our lives harder at the compiler level? (I'm worried about e.g. polymorphic functions over seq<T>.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, if we change seq<T> from a type to a trait, it would be nice to make string implement seq<char>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split this off as a separate discussion: dafny-lang/dafny#2582

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is one point I didn't burden the proposal with, but I'll say two things:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that being able to define seq in Dafny … will provide much of the same benefit as literally making seq a trait directly.

Yep, 100% this

We can decide on which optimization is better based on profiling typical applications,
and possibly even provide both as options.

# Drawbacks
[drawbacks]: #drawbacks
Expand Down Expand Up @@ -245,6 +263,13 @@ Dafny could instead make `string` an alias for a different type, whether built-i
that treats strings as a more specialized datatype. It could then be more prescriptive about picking a definitive encoding to
support directly (most likely UTF-8).

The primary advantage of this would be avoiding the potential trap of expensive random access
of individual character values at runtime.
Such a `string` type would instead only provide access to individual characters
through some kind of enumeration interface
similar to those proposed in this [standard library PR](https://github.com/dafny-lang/libraries/pull/37).
It would still use a ghost `seq<char>` value to specify the underlying data.

This would certainly require more effort for existing code to migrate to.
It is also a much more expensive solution to implement, since it requires creating an API for strings
that is at least adequate to replace all existing usage of sequence operations.
Expand All @@ -253,6 +278,8 @@ and would need very careful thought.
Although unusual, representing strings directly as sequences of abstract characters
works very well for a verification-focused language like Dafny,
since sequences are already a deep built-in concept in the language semantics.
Note that sequences already do not guarantee constant time for a single `s[i]` expression,
given the lazy evaluation of concatenation.

This new type could alternatively be introduced with a different name, such as `unicode` as in Python 2,
while keeping the alias of `string` for `seq<char>`.
Expand Down Expand Up @@ -384,11 +411,23 @@ that present the contents as sequences of UTF-8, UTF-16, and UTF-32 code units r
# Unresolved questions
[unresolved-questions]: #unresolved-questions

Is there anything more we can do to make migration easier and safer for users?
Chance are very good that all Dafny code in existence to date either will not change behavior
across this change, or will slightly improve because of the improved handling of surrogate code points.
I have been unable to think of anything that would provide more value than the verifier will already provide,
but I am open to suggestion as always!
* How should a function such as `UTF8BytesAsString(utf8Bytes: seq<uint8>): string`
be implemented? It can absolutely be implemented in pure Dafny
as per the logic under https://github.com/dafny-lang/libraries/blob/master/src/Unicode,
but the result will be an unoptimized sequence of 32-bit values.
This could be implemented in the per-language Dafny runtimes instead,
and hence take advantage of the native string wrappers described above.
Ideally there would be some way to have a single verified Dafny implementation of this
conversion function that would still maintain the memory efficiency of UTF-8.
* One use case this proposal does not cover is including a Dafny type
that will compile to a native type used for UTF-16 code units,
such as `char` in Java or C#, or `char16_t` in C++.
User Dafny code can come close by defining a `newtype` with the appropriate bounds,
but this will compile to a compatible integral native type
such as `uint16` instead.
Having to cast between this type and the native character type
does not currently seem to be enough of a usability or efficiency concern
to worry about.

# Future possibilities

Expand Down