Revisit definition of Dafny strings (RFC #12) #413

robin-aws · 2019-11-05T16:49:54Z

(was: "Explicitly document that Dafny strings are UTF-16")

I found this just a bit too subtle when reading the reference manual. There are a couple references to "Unicode letters" and "Unicode characters" that should really be "Unicode 16-bit code units". The grammar does spell out that a Unicode character is four hex digits, but that's a little bit buried.

Even better, I would love it if Dafny characters were actually sequences of Unicode code points, so that the built-in sequence operations could be used to perform robust per-character string manipulation.

robin-aws · 2020-02-12T00:10:48Z

Going further than this, even if the canonical representation of strings in Dafny is using UTF16, another issue with just making string an alias for seq<char> means that it allows invalid UTF16 sequences (those that use surrogate code points incorrectly for example).

Given that Unicode is so often misunderstood and often the source of subtle bugs, I think there's a great opportunity to have a verifiably-correct implementation of Unicode in Dafny. Languages like Java and C# will often allow invalid sequences and can be lax with encoding/decoding, but I think it would be much better to be strict in Dafny.

robin-aws · 2020-05-14T18:17:55Z

Making string an alias for seq<char> also has implications for interacting with external code: it is necessary to convert from the native representation in the target language to the runtime implementation of Dafny sequences. This can have performance consequences and also complicates the "foreign function interface": because sequences are value types in Dafny you can't refer to a possibly-null string value, which happens all the time in external code.

robin-aws · 2020-06-19T17:11:53Z

My current plan here is to keep string as an alias for seq<char>, but to redefine char to mean "Unicode scalar value" ala Rust, and to therefore need 32 bits instead of 16. This means the direct value of a string is the "true" Unicode sequence. Within each target runtime, it should be possible to implement this type by wrapping the native representation of strings (as Java already does with the StringDafnySequence class), therefore allowing more efficient encoding of these values. This will help with both performance and the FFI - yes, doing per-character operations will be slow because the native wrappers will not be random access w.r.t. the actual Unicode scalar values, but it should be possible to optimize all the operations Unicode-compliant code will actually need to do.

indo-dev-0 · 2020-11-14T06:24:13Z

Hi! I'm modelling bits of Unicode in Dafny.

I think you will need to offer a fairly rich string API. Users who are modelling abstract systems to ensure correctness of an algorithm probably don't want to spend their compute budget proving anything about strings. There are also plenty of specifications that limit input to printable ASCII, Base64, alphanumerics, etc.

Users who want to prove complex things about strings will also need to worry about low level details like normal forms, stream safety, Unicode versions, and platform-specific or even broken behaviour. At that point, Unicode Scalar Value won't be good enough, as you will need to reason about non-character code points.

I think it would be interesting to see a functional base class that works across all string types, with iterators for low-level per-codepoint/per-scalar access but emphasizing higher level operations like split, contains, replace, pad, etc. A richer API would result in fewer programmer errors, as there are lots of sharp corner cases in the Unicode standard, especially with regard to comparison functions.

Did you find any high-assurance implementations of Unicode? I found some stuff on Liquid Haskell verifying a subset of Haskell's text handling library, but Unicode is not a very specific search query.

robin-aws · 2020-11-16T17:12:41Z

Hi there! I think we have compatible but fairly different goals here. My main intention with this issue was to make the Dafny built-in string type represent the abstract, semantic value of a string instead of the implementation. The current definition is definitely too tightly coupled to the latter since it refers to 16-bit code units, which is how strings are implemented in .NET and Java, for example, but certainly not in all target languages.

As I mentioned in my last comment, the immediate plan is only to defer to existing string implementations in the target languages. This will be the right choice for most programs for efficiency and compatibility with external code.

What you're describing should definitely be possible too, however: a Dafny-native implementation of Unicode based on actual byte sequences, which would use ghost string values to prove correctness. I'd suggest you define your own Unicode scalar value type for now (something like newtype myChar := c: int | ...) so that you can use seq<myChar> for a similar purpose and not be thrown off by the current definition of string.

indolering · 2020-11-18T00:51:03Z

What you're describing should definitely be possible too, however: a Dafny-native implementation of Unicode based on actual byte sequences, which would use ghost string values to prove correctness. I'd suggest you define your own Unicode scalar value type for now (something like newtype myChar := c: int | ...) so that you can use seq for a similar purpose and not be thrown off by the current definition of string.

This is what I have done, although using nats as the base value. I've also created parallel OO versions of each type, primarily as a pedagogical exercise. I'm assuming it will be easier to verify statements about sets/maps of the primitive type than it will be to verify transforms of the OO versions.

seanmcl · 2021-06-09T15:15:16Z

I think implementing UTF-8 and UTF-16 in Dafny, rather than in C#, is a good approach. Like @indolering mentioned, you could get all sorts of nice theorems about the relationship between the underlying byte array and the string. Keeping the Dafny implementation as small as we can is also a benefit. It's interesting to think about how to make Dafny code that uses strings agnostic in its unicode encoding, which would be something we need if we want to compile to languages with different encodings.

indolering · 2021-06-10T02:07:19Z

I'm planning on doing this, I just need to get some design feedback from the Unicode mailing list and apply for funding.

robin-aws · 2022-07-04T18:16:55Z

I've got a concrete design in mind for Dafny 4.0, and will write it up in a short RFC.

robin-aws · 2022-07-04T18:27:38Z

dafny-lang/rfcs#12

Implementation of the design from dafny-lang/rfcs#13. Resolves #413. Fixes #2928. Fixes #818. Fixes #3058. Fixes #1293. Fixes #3001. Depends on #2976 to be fixed for the tests to pass consistently across platforms. I've documented some of the less obvious compilation strategy decisions in `docs/Compilation/StringsAndChars.md`. <small>By submitting this pull request, I confirm that my contribution is made under the terms of the [MIT license](https://github.com/dafny-lang/dafny/blob/master/LICENSE.txt).</small>

robin-aws changed the title ~~Explicitly document that Dafny strings are UTF-16~~ Revisit definition of Dafny strings May 14, 2020

robin-aws mentioned this issue May 14, 2020

Design a sound Dafny/.NET shim layer aws/aws-encryption-sdk-dafny#165

Closed

7 tasks

robin-aws mentioned this issue Jun 19, 2020

print in JS sometimes prints array of char instead of string #258

Closed

robin-aws added the kind: enhancement Enhancements, feature requests, etc. These are NOT bugs, but ways we can improve Dafny label Jun 23, 2020

acioc added this to the Dafny 3.0 milestone Aug 12, 2020

robin-aws mentioned this issue Aug 22, 2020

Dafny does not support unicode characters (outside of escape sequences) #818

Closed

davidcok modified the milestones: Dafny 3.0, Dafny 3.1 Dec 23, 2020

robin-aws mentioned this issue Sep 22, 2021

Dafny strings need a lexicographic comparison operator #1070

Open

robin-aws mentioned this issue Jan 21, 2022

Design Proposal: Support for enumeration #1753

Open

atomb removed this from the Dafny 3.1 milestone Apr 21, 2022

robin-aws mentioned this issue Jun 6, 2022

Dafny string literals with surrogate characters compiled to invalid Go strings #1980

Closed

robin-aws added part: language definition Relating to the Dafny language definition itself breaking-change Any change that will cause existing Dafny codebases to break (excluding verification instability) labels Jun 29, 2022

robin-aws self-assigned this Jun 29, 2022

robin-aws mentioned this issue Jul 4, 2022

RFC pending: Unicode string support dafny-lang/rfcs#12

Closed

robin-aws mentioned this issue Jul 14, 2022

RFC: Unicode strings and characters dafny-lang/rfcs#13

Open

robin-aws added this to the Dafny 4.0 milestone Aug 12, 2022

robin-aws changed the title ~~Revisit definition of Dafny strings~~ Revisit definition of Dafny strings (RFC #12) Oct 4, 2022

robin-aws mentioned this issue Nov 8, 2022

feat: Unicode strings (--unicode-char) #3016

Merged

robin-aws closed this as completed in #3016 Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit definition of Dafny strings (RFC #12) #413

Revisit definition of Dafny strings (RFC #12) #413

robin-aws commented Nov 5, 2019 •

edited

Loading

robin-aws commented Feb 12, 2020

robin-aws commented May 14, 2020

robin-aws commented Jun 19, 2020

indo-dev-0 commented Nov 14, 2020

robin-aws commented Nov 16, 2020

indolering commented Nov 18, 2020

seanmcl commented Jun 9, 2021

indolering commented Jun 10, 2021

robin-aws commented Jul 4, 2022

robin-aws commented Jul 4, 2022

Revisit definition of Dafny strings (RFC #12) #413

Revisit definition of Dafny strings (RFC #12) #413

Comments

robin-aws commented Nov 5, 2019 • edited Loading

robin-aws commented Feb 12, 2020

robin-aws commented May 14, 2020

robin-aws commented Jun 19, 2020

indo-dev-0 commented Nov 14, 2020

robin-aws commented Nov 16, 2020

indolering commented Nov 18, 2020

seanmcl commented Jun 9, 2021

indolering commented Jun 10, 2021

robin-aws commented Jul 4, 2022

robin-aws commented Jul 4, 2022

robin-aws commented Nov 5, 2019 •

edited

Loading