-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Unicode strings and characters #13
base: master
Are you sure you want to change the base?
Changes from 1 commit
f64cd44
098ccbd
efc0ba8
2c5ae54
1c9ecda
9a636ed
5f7d0e0
d347572
16f6a97
04d2553
931d679
3b87491
30c7ea7
cbb491e
560f395
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
- Loading branch information
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -143,39 +143,74 @@ without having to decode these bytes into character values. | |
# Drawbacks | ||
[drawbacks]: #drawbacks | ||
|
||
Why should we *not* do this? | ||
Changing the behavior of an existing concept in a backwards-incompatible way always carries a high cost, | ||
and I do not propose it lightly. | ||
|
||
***TODO*** | ||
|
||
# Rationale and alternatives | ||
[rationale-and-alternatives]: #rationale-and-alternatives | ||
|
||
***TODO*** | ||
|
||
## Enforce valid UTF-16 strings in the verifier | ||
|
||
## Encode valid UTF-16 strings as a library predicates | ||
***TODO*** | ||
|
||
## Define "valid UTF-16 strings" as a predicate in a shared library | ||
|
||
***TODO*** | ||
|
||
## Distinct string type | ||
## Add a new, distinct string type | ||
|
||
## Distinct "rune" type | ||
Since the current definition of `string` as an alias for `seq<char>` | ||
***TODO*** | ||
|
||
## Add a new, distinct "rune" type | ||
|
||
We could maintain the current definition of the `char` type, and introduce a new `rune` type to represent Unicode scalar values | ||
instead ("rune" being the term Go uses for this). | ||
instead ("rune" being the term both Go and C# use for this). | ||
This would make it more obvious when reading Dafny source code in isolation whether it was using the new definitions of strings. | ||
There are often few explicit references to individual character values in code that references string values, however, | ||
and even when the `char` type is used it can often be only implicitly referenced because of type inference. | ||
This alternative does not seem to be worth the implementation cost or additional cognitive load on users, | ||
especially since it is hard to imagine any codebase needing to use both `char` and `rune` together. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure about this (never using both char and rune): IIRC Windows file system APIs use sequences of Dafny-3 Maybe this just means that we'll have to build file system APIs with There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, if you need to manipulate possibly-invalid sequences of UTF-16 code points, that should probably be a I'm personally happy with the trade-off of a simpler, maximally correct and portable definition of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added an explicit note on this, and an open question on being able to compile to a native utf-16 type. |
||
|
||
## Disallow compiling s[i] | ||
|
||
Given popular Unicode encodings such as UTF-8 and UTF-16 do not support random access of characters, | ||
we could decide that | ||
we could decide to forbid random access of string elements in compiled code, | ||
instead directing users to fallback to external code for efficient access, | ||
or to sequential access via the new [ordered quantification features](https://github.com/dafny-lang/rfcs/pull/10) (once they are implemented). | ||
This would be a much more disruptive breaking change for existing code, however. | ||
String indexing would not be the first instance in Dafny of a simple, mathematical expression | ||
that supports verification easily but is inefficient at runtime. | ||
The proposed lazy decoding approach should provide a good balance between the clean expression of concepts | ||
and efficient, unsurprising runtime behavior. | ||
|
||
# Prior art | ||
[prior-art]: #prior-art | ||
|
||
Unicode has a long and messy history, and the approach to supporting Unicode varies dramatically across different programming languages. | ||
Here is the current state of many of the most popular programming languages, especially those that Dafny currently or will likely | ||
soon support compiling to: | ||
|
||
***TODO*** | ||
|
||
* C#: | ||
* Java: | ||
* Go: | ||
* JavaScript: | ||
* C/C++: | ||
* Rust: | ||
* Swift: | ||
|
||
# Unresolved questions | ||
[unresolved-questions]: #unresolved-questions | ||
|
||
***TODO*** | ||
|
||
# Future possibilities | ||
[future-possibilities]: #future-possibilities | ||
|
||
***TODO*** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with
char
is that in unicode "character" refers to something that's pretty different from a scalar value (in particular, a "character" in unicode can be made up of more than one "scalar value").There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was trying to say I would personally find it confusing to read code that uses both
char
to represent UTF-16 code units andrune
to represent Unicode scalar values. Code like https://github.com/dafny-lang/libraries/tree/master/src/Unicode specifically doesn't connect the built-inchar
type to anything because it generalizes over multiple encoding forms and UTF-16 isn't special in that library.You're also absolutely right that even with
/unicodeChar:1
,char
still doesn't correspond to an abstract Unicode character. Perhaps you're arguing that we should deprecatechar
and replace it withrune
to more aggressively avoid that confusion? I can add that as another option here: just deprecatingchar
for now wouldn't be as disruptive as some of these options and could be worth it.