Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect printing of generically-typed strings in Go (a.k.a. should string != seq<char>?) #2582

Open
robin-aws opened this issue Aug 11, 2022 · 15 comments
Assignees
Labels
breaking-change Any change that will cause existing Dafny codebases to break (excluding verification instability) testing-method: uniform-backend-testing Issues found by ensuring uniform testing across backends

Comments

@robin-aws
Copy link
Member

method Main() {
  var s := "!dlroW olleH";
  PrintReversed(s);
}

method PrintReversed<T>(s: seq<T>) {
  var reversed := Reverse(s);
  print reversed, "\n";
}

function method Reverse<T>(s: seq<T>): seq<T> {
  // Not s == [] to avoid hitting https://github.com/dafny-lang/dafny/issues/2581 :)
  if |s| == 0 then [] else Reverse(s[1..]) + [s[0]]
}

This prints Hello, World! in every target language except for Go, which prints [H, e, l, l, o, , W, o, r, l, d, !].

The issue is that string is just an alias for seq<char> in Dafny, but we have a special case of implicit parametric type specialization in Dafny, where print-ing a string should look different than a generic sequence.

Each backend implements this in almost completely unique ways:

  • C# relies on the fact that generic types are reified by checking if (typeof(T) == typeof(char))
  • Java relies on explicitly passed TypeDescriptor values
  • Javascript actually instantiates a native string instead of a dafny.Seq (leading to other issues)
  • Go relies on compiler-inserted hints that a value is known statically to be a seq<char> (which is why it fails on this example)
  • Python dynamically checks if every element of a sequence is a character (specifically a 1-element str value), which is the only reason it passes this example.
  • The C++ runtime uses template specialization to special-case operator<< for a DafnySequence<char>&.

@RustanLeino has suggested that one solution would be to align on Go's approach, such that a sequence value only prints like a string if the static type of the value is seq<char>. This would be a step backwards in some cases, but seems reasonable given that print is much more of a debugging tool than a rigorous I/O feature, and hence best-effort support that is 100% consistent and simple to implement in each runtime is appealing.

Another option is to make string a distinct type from seq<char>, given that we are already planning on making breaking changes related to strings to provide better Unicode support (dafny-lang/rfcs#13). At a minimum we could have string act like a newtype based on seq<char>, so that it would support all the same sequence operations but requiring as casting to be used as a seq<char>.

Or we could go much further and define string as a completely separate built-in type with its own API, possibly NOT including random-access element selection (s[i]) to allow for the fact that native strings encoded with UTF-8 or UTF-16 cannot directly support random access. I wouldn't recommend this, as it would not only be a much bigger migration pain for users but would also mean a lot more work in the verifier to understand strings and on Dafny users to achieve verification (and will address this in more detail in the RFC).

@cpitclaudel
Copy link
Member

Very nice investigation 😮

For the record, our documentation considers this a TODO for go (https://dafny.org/dafny/Compilation/Go.html):

  • There isn’t always enough run-time type information to determine whether a sequence is a string (that is, a seq). In particular, a value created as a seq will always be output as a sequence of individual A values rather than as a string, even if A is char for a particular invocation.

@robin-aws
Copy link
Member Author

I neglected to mention another option: align on the Java solution of always providing a TypeDescriptor when constructing sequence values when the target language doesn't reify enough information itself, so that even in generic code we always know when a value is a seq<char>. This could be advantageous for languages that need to monomorphize sequences for good performance (which is the main reason Java provides them, and which other runtimes such as Go and JavaScript should be as well (#1896). It would also mean additional compiler complexity and a mild runtime overhead though.

@RustanLeino
Copy link
Collaborator

I'm in favor of (keeping string as the synonym of seq<char> as it has always been, and) changing what we do for print to be clear about when we print things as string. The goal would be to

  • have a clear, documented design that supports print in the common case of outputting simple messages and debugging information, and
  • allow a simple implementation without too much run-time type information,
  • be consistent among the various compilers.

I propose:

if the static, declared type of an expression given to `print` is `seq<char>`, then
    print it as a string
else
    print it in a way that resembles what the value would look like in a Dafny program

More specifically, "print as a string" means printing each character as a printable character, without any separator between the characters and without any enclosing brackets or double quotes. This seems like what you want when you're knowingly giving print a string; for example, print 2, " is not ", 3, "\n"; would output

2 is not 3

(with a newline at the end).

In the else branch above, note that a char would (breaking change!) no longer be printed as a printable character, but would instead be printed as the Dafny-program-text representation of a char literal (e.g., `a` or `\n` or `\u20c3`). If you have an expression c of type char and you instead want to print it as a printable character, simply convert it to a string: print [c];.

When printing the field of a datatype, the "static, declared type" refers to the type of the field in the datatype declaration, not the instantiated type. This is also a breaking change, because the current behavior (even in some cases in Go) seems to look at the run-time type of values to print. So, according to what I'm proposing, given

datatype D<X> = D(X, string)

the statement print D("hello", "there") would output

D(['h', 'e', 'l', 'l', 'o'], "there")

@robin-aws robin-aws added the breaking-change Any change that will cause existing Dafny codebases to break (excluding verification instability) label Aug 12, 2022
@robin-aws
Copy link
Member Author

Thanks @RustanLeino! I like your proposal even though it is a breaking change and a step backwards in some runtimes. I think making the runtimes consistent and avoiding the cost/complexity of tracking this RTTI is worth it.

Just to clarify, your D example seems to imply the following rules, expressed in pseudo-reflection-supporting-Dafny:

// Independent semantics for each argument to print statments
method Print<T>(t: T) {
  if T == string {
    print t as string;
  } else {
    print ToString(t);
  }
}

// Similar to __repr__ in Python: meant to produce Dafny expression syntax
// All runtimes have something equivalent to this.
function method ToString<T>(t: T) {
  if T == string then
    '"' + t as string + '"'
  else if T == seq<E> then
    '[' + CommaSeparated(Seq.Map(ToString, t as seq<T>)) + ']'
  else if T == D<X> then
    var D(x, s) := t as D<X>;
    // X here might be a concrete type like string, or could be a type parameter
    // and therefore not provide any help to ToString<T>.
    'D(' + ToString<X>(x) + ', ' + ToString<string>(s) + ')'
  else (etc.)
}

Just pointing out that we seem to need special casing of seq<char> as a statically-known type for both print and the implicit internal recursive function to convert to strings.

Is this correct as well?

datatype D<X> = D(X, string)

method Main() {
  var d: D<string> := D("hello", "there");
  print d;    // prints D("hello", "there"), since X is statically known to be string?
}

If so, it implies that we need to thread a type descriptor through ToString, and one with more detail than we currently track in most runtimes.

@keyboardDrummer
Copy link
Member

keyboardDrummer commented Aug 15, 2022

It seems surprising that a change in a static type, which doesn't cause a type error, influences runtime behavior. Does that occur anywhere else in the language?

Have we considered having one printing option where the static type must be a seq<char>, that "prints as a string", and another that has no type restrictions but always prints "in a way that resembles what the value would look like in a Dafny program" ?

For example:

var x := [1,2,3];
var y := "hi";
print x; // Error, x is not a seq<char>
print y; // Accepted, prints "hi"
print stringify(x); // Accepted, prints "[1, 2, 3]"
print stringify(y); // Accepted, prints "['h', 'i']"

That's more verbose and I guess @RustanLeino you're argueing that print is only for debugging, so it's better for it to be what you want 99% of the time and cheap to write, than what you want a 100% of the time and verbose. I'm OK with that just wanted to discuss an alternative.

@RustanLeino are you also saying you expect us to keep a dedicated print for debugging after we've added more IO features, meant for non-debugging use, to Dafny?

@cpitclaudel
Copy link
Member

When printing the field of a datatype, the "static, declared type" refers to the type of the field in the datatype declaration, not the instantiated type.

But why? Surely printing Some(x).value and x should give the same thing, no?

@keyboardDrummer
Copy link
Member

keyboardDrummer commented Aug 15, 2022

But why? Surely printing Some(x).value and x should give the same thing, no?

I think the suggestion is to have print work like this:

print Some("hi").value; // prints "hi"
print Some("hi"); // prints "Some(['h','i'])"
print OnlyContainsString("hi"); // prints "OnlyContainsString(hi)"

I think because it only requires implementing a single runtime type based print function. You could generate a print function based on the type Option<string> that prints "Some(hi)" when doing print Some("hi");, but that might be too much work for the gain.

@cpitclaudel
Copy link
Member

Ah! Hmm. Maybe that fine then ^^
Thanks for the clarification! :)

@robin-aws robin-aws self-assigned this Oct 24, 2022
@robin-aws
Copy link
Member Author

I've pulled in implementing this change as part of dafny-lang/rfcs#13, so this breaking change is part of the larger breaking change controlled by --unicode-char.

robin-aws added a commit to robin-aws/dafny that referenced this issue Oct 25, 2022
@robin-aws
Copy link
Member Author

robin-aws commented Oct 25, 2022

One consequence of this design I'm only seeing now that I'm implementing it: we are avoiding the need to maintain runtime information to distinguish string from seq<char>, but we are increasing the need to track the distinction between char and other types in the various runtimes:

method Print<T>(t: T) {
  print t, "\n";
}

We have been implying that Print('D') should print 'D' in all runtimes, or at least something that is definitely a character and not something like D or "D" or 67 etc. That's actually going to be relatively complicated in backends that don't have a dedicated type for Unicode scalar values. Java should be tractable because we already pass TypeDescriptors around everywhere, which lets us distinguish using a Java int as a Dafny int subset type vs. a Dafny char. Python and Javascript, on the other hand, have no character types at all, implying we'll have to include one in their Dafny runtimes.

I still think it's a very good idea to make print behave consistently across the runtimes, and given char IS a distinct type in Dafny we should print it as such in all cases. I'm just writing this down for posterity and making myself feel better by ranting. :)

@RustanLeino
Copy link
Collaborator

@robin-aws Good catch. This is a bummer. I think this is where print will have to continue to be just a best-effort. So, I suggest we just print such a char as a number (or whatever the target representation uses). We have similar problems with, for example, unsigned integers being represented using a signed integer in the target language.

@robin-aws
Copy link
Member Author

Agreed it's a bummer, but I'm still leaning towards making the proposed behavior work. It's at least tractable to implement in all the backends, just harder to implement efficiently in some.

Allowing the backend to choose the most convenient behavior means we are regressing in terms of consistency across backends. By my testing Print('D') currently always prints D instead of 'D', so it's at least possible to assert a program outputs the same thing.

@robin-aws
Copy link
Member Author

@RustanLeino I've found a good solution for Java that seems to work well with no runtime overhead, and wanted to write it down here in case someone else wants to use it to solve the signed integer problem later.

We use primitive types like int to represent more specific Dafny types such as a signed integers or (once --unicode-char is implemented) a Unicode scalar value. Because Java requires boxing to use such values in generic code, we lose this distinction when we box up an int as an Integer. The solution is to define our own boxing type, e.g. CodePoint, that wraps an int just as Integer does, but therefore knows the true intended type more precisely and overrides toString() accordingly. If we then also define a TypeDescriptor for UNICODE_CHAR that uses int as the primitive, unboxed type but CodePoint as the boxing type, we can still use the optimized Array class to store a sequence of code points in an int[].

This requires the java compiler code to augment EmitCoercionIfNecessary and invoke it in a few more places, in order to convert between int and CodePoint, but so far that doesn't seem to be a very big change. We only get away with not converting in these places because Java supports autoboxing for the built-in types, but the Dafny compiler can do something very similar for our own custom boxing types.

There's an added benefit of increased type safety in the generated code as well, preventing us from accidentally mixing up an Integer and a CodePoint (or a SignedInteger in the other case).

@RustanLeino
Copy link
Collaborator

A late response to something earlier in this thread. In what I proposed, I had not intended what @robin-aws asked about on Aug 12. In particular, I had intended for the ToString pseudo-code method not to have a case for if T == string.

Instead, I had intended what @keyboardDrummer had said on Aug 15, namely that print would behave like this:

print Some("hi").value; // prints "hi"
print Some("hi"); // prints "Some(['h','i'])"
print OnlyContainsString("hi"); // prints "OnlyContainsString(hi)"

Regarding Java, I do like @robin-aws 's suggestion of using our own boxing type for things like (our) char and unsigned integers.

@robin-aws
Copy link
Member Author

robin-aws commented Nov 24, 2022

I didn't see this response until after merging #3016, but let me respond in case we want to tweak the feature before releasing.

I had intended for the ToString pseudo-code method not to have a case for if T == string

That's good because I didn't end up implementing that per se. :) I did end up something close that only relied on static type information at the point of declaration, similar to the top-level case for print, which means that while the first two lines are correct:

print Some("hi").value; // prints "hi"
print Some("hi"); // prints "Some(['h','i'])"

The third is slightly different:

print OnlyContainsString("hi");

will produce OnlyContainsString("hi") rather than OnlyContainsString(hi). The latter would still require effort to recognize the parameter is a string and hence not print OnlyContainsString(['h','i']). Given that, it seemed better to produce OnlyContainsString("hi") since it more closely "resembles what the value would look like in a Dafny program".

If you still have concerns @RustanLeino let me know. Otherwise I'll close this issue as --unicode-char at least addresses the problem, and we don't have any plans to change behavior in --no-unicode-char mode.

@robin-aws robin-aws added the testing-method: uniform-backend-testing Issues found by ensuring uniform testing across backends label Jun 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking-change Any change that will cause existing Dafny codebases to break (excluding verification instability) testing-method: uniform-backend-testing Issues found by ensuring uniform testing across backends
Projects
None yet
Development

No branches or pull requests

4 participants