Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce measure-complexity command #3061

Merged
merged 20 commits into from
Dec 16, 2022

Conversation

keyboardDrummer
Copy link
Member

@keyboardDrummer keyboardDrummer commented Nov 16, 2022

Changes

  • Add the measure-complexity command and related options, see below for its help output. This command is described as experimental, since its name, options and output is expected to change in the future. We expect to merge https://github.com/dafny-lang/dafny-reportgenerator into this repository and include its features in this command.

Testing

  • Updated existing tests to use the new command and options.

Data

Description:
  The Dafny CLI enables working with Dafny, a verification-aware programming language. Use 'dafny /help' to see help 
  for a previous CLI format.

Usage:
  Dafny [command] [options]

Options:
  --version       Show version information
  -?, -h, --help  Show help and usage information

Commands:
  resolve <file>                      Only check for parse and type resolution errors.
  verify <file>                       Verify the program.
  run <file> <program-arguments>      Run the program. []
  build <file>                        Produce an executable binary or a library.
  translate <file>                    Generate source and build files in a specified target language.
  measure-complexity <file>           (Experimental) Measure the complexity of verifying the program.
  server                              Start the Dafny language server
  test <file>                         (Experimental) Execute every method in the program that's annotated with the 
                                      {:test} attribute.
  generate-tests <Block|Path> <file>  (Experimental) Generate Dafny tests that ensure block or path coverage of a 
                                      particular Dafny program.
  find-dead-code <file>               (Experimental) Use counterexample generation to warn about potential dead code.

Usage:
  Dafny measure-complexity [<file>...] [options]

Arguments:
  <file>  input files

Options:
  --iterations <n>                             Attempt to verify each proof n times with n random seeds. If random-seed 
                                               is used, each proof attempt will use a new random seed derived from this 
                                               one. If not, it will use a different seed between iterations. [default: 
                                               10]
  --random-seed <seed>                         Turn on randomization of the input that Boogie passes to the SMT solver 
                                               and turn on randomization in the SMT solver itself. Certain Boogie 
                                               inputs are unstable in the sense that changes to the input that preserve 
                                               its meaning may cause the output to change. This option simulates 
                                               meaning-preserving changes to the input without requiring the user to 
                                               actually make those changes. The input changes are renaming variables 
                                               and reordering declarations in the input, and setting solver options 
                                               that have similar effects. [default: 0]
  --format <configuration>                     Logs verification results using the given test result format. The 
                                               currently supported formats are `trx`, `csv`, and `text`. These are: the 
                                               XML-based format commonly used for test results for .NET languages, a 
                                               custom CSV schema, and a textual format meant for human consumption. You 
                                               can provide configuration using the same string format as when using the 
                                               --logger option for dotnet test, such as: 
                                               --format:trx;LogFileName=<...>.z
  
                                               The `trx` and `csv` formats automatically choose an output file name by 
                                               default, and print the name of this file to the console. The `text` 
                                               format prints its output to the console by default, but can send output 
                                               to a file given the `LogFileName` option.
  
                                               The `text` format also includes a more detailed breakdown of what 
                                               assertions appear in each assertion batch. When combined with 
                                               the isolate-assertions option, it will provide approximate time and 
                                               resource use costs for each assertion, allowing identification of 
                                               especially expensive assertions. []
  --isolate-assertions                         Verify each assertion in isolation. [default: False]
... common options...

By submitting this pull request, I confirm that my contribution is made under the terms of the MIT license.

@keyboardDrummer keyboardDrummer enabled auto-merge (squash) November 16, 2022 11:46
@keyboardDrummer keyboardDrummer changed the title Introduce Stability command Introduce check-stability command Nov 16, 2022
Copy link
Member

@robin-aws robin-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Top level comment: we've never collectively been happy with the term "stability", and I've got a mostly-written proposal for what I feel is much better terminology. I'd prefer to block before we lock down that term in the actually tool chain. Happy to prioritize resolving the proposal ASAP so that's not a big delay.

@keyboardDrummer
Copy link
Member Author

Top level comment: we've never collectively been happy with the term "stability", and I've got a mostly-written proposal for what I feel is much better terminology. I'd prefer to block before we lock down that term in the actually tool chain. Happy to prioritize resolving the proposal ASAP so that's not a big delay.

That's fine. Consider this PR a way to make that upcoming decision more concrete :-D

Copy link
Member

@atomb atomb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very excited to have this feature available in the new CLI! Most of my comments are just wording changes, but I'm also wondering whether this should be a set of additional flags to verify instead of a separate command.

public static readonly FormatOption Instance = new();
public override object DefaultValue => Enumerable.Empty<string>();

public override string LongName => "format";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe name this option --log-format instead of just --format?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's make you think of the word log? This command is expected to return data on stability, and this command changes how that data is shown.

@@ -45,6 +45,7 @@ static class CommandRegistry {
AddCommand(new RunCommand());
AddCommand(new BuildCommand());
AddCommand(new TranslateCommand());
AddCommand(new StabilityCommand());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer this to be a set of options to the verify command rather than a separate command, though I'm open to arguments to the contrary. At the very least, though, it will be valuable to be able to analyze stability (or whatever we call it) in the context of any other set of verification options, so this command (if it's a separate command) should accept all (or almost all) of the same options as the verify command.

Copy link
Member Author

@keyboardDrummer keyboardDrummer Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this command should accept all (or almost all) of the same options as the verify command.

Agreed

I think I'd prefer this to be a set of options to the verify command rather than a separate command, though I'm open to arguments to the contrary

Why do you prefer that?

Having them be separate commands means that the options that only relate to stability, of which I think there can be quite a few, can be shown at the top of the help for the stability command, which makes it easier to use.

The stability command also has different output than the verify command since it also includes a statistics report, which I would print to stdout before any diagnostics, and when printing diagnostics it might be good to include the random seed required to get that diagnostic. Having it be a separate command prepares the user for a difference in output.

}.Concat(CommandRegistry.CommonOptions);

public Command Create() {
var result = new Command("check-stability", "(experimental) Check the stability of the program proofs. Be aware that the name, options and output of this command will change in the future.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder whether it would make sense in the new CLI framework to allow certain commands or options to be programmatically marked as experimental. This would make it possible to disallow experimental features in settings where change is particularly disruptive.

Source/DafnyDriver/Commands/StabilityCommand.cs Outdated Show resolved Hide resolved
public override string ArgumentName => "seed";

public override string Description =>
$"Turn on randomization of the input that Boogie passes to the SMT solver and turn on randomization in the SMT solver itself. Certain Boogie inputs are unstable in the sense that changes to the input that preserve its meaning may cause the output to change. This option simulates meaning-preserving changes to the input without requiring the user to actually make those changes. The input changes are renaming variables and reordering declarations in the input, and setting solver options that have similar effects.";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this mention Boogie instead of Dafny? I think replacing each instance of "Boogie" with "Dafny" would be similarly useful and wouldn't require knowledge of Boogie.

public override string ArgumentName => "n";

public override string Description =>
$"Attempt to verify each proof n times with n random seeds. If {RandomSeedOption.Instance.LongName} is used, each proof attempt will use a new random seed derived from this one. If not, it will use a different seed between iterations.";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text "a different seed between iterations" seems a little under-specified to me. We could mention that the base random seed is 0 when not otherwise specified.

@keyboardDrummer keyboardDrummer changed the title Introduce check-stability command Introduce check-proof-complexity command Dec 1, 2022
@robin-aws
Copy link
Member

Besides the terminology comment, I have one other observation about moving dafny-reportgenerator into this repo and potentially integrating the functionality into the verify command.

We should ensure we maintain support somehow for distributing the verification work to multiple independent invocations of dafny, possible on different hosts. That's one of the motivations behind building the support for serializing this data through the TRX and CSV formats rather than just producing human-readable results with a single call to dafny verify.

The dafny-reportgenerator functionality should move into a separate command like dafny generate-report that can process the results from multiple verification runs. We should make the common process of verifying a code base with multiple random seeds and summarizing the results through a single command convenient, but still support the more general case as well. We should NOT assume it's good enough to just push this functionality directly into verify or check-proof-complexity.

@keyboardDrummer
Copy link
Member Author

keyboardDrummer commented Dec 5, 2022

We should ensure we maintain support somehow for distributing the verification work to multiple independent invocations of dafny, possible on different hosts.

Why? I think we can let a single dafny process drive all the verification, which through the configuration of the solver can let the actual verification work be done remotely. Seems simpler and just as performant to me.

@robin-aws
Copy link
Member

We should ensure we maintain support somehow for distributing the verification work to multiple independent invocations of dafny, possible on different hosts.

Why? I think we can let a single dafny process drive all the verification, which through the configuration of the solver can let the actual verification work be done remotely. Seems simpler and just as performant to me.

That's one option but it's definitely more limiting, because now dafny and its dependencies always has to take responsibility for distributing the computation. The CI for this repo, for example, uses multiple GitHub actions jobs across multiple runners to split up the integration test suite. It's definitely tractable to have each job log verification results and then have a meta-job summarize them (and we probably should do something like that to track complexity changes over time). If we require that there has to be a single top-level dafny invocation that distributes the work instead, we'd have to drop a huge amount of GHA functionality and take on our own fleet management.

@keyboardDrummer
Copy link
Member Author

keyboardDrummer commented Dec 5, 2022

We should ensure we maintain support somehow for distributing the verification work to multiple independent invocations of dafny, possible on different hosts.

Why? I think we can let a single dafny process drive all the verification, which through the configuration of the solver can let the actual verification work be done remotely. Seems simpler and just as performant to me.

That's one option but it's definitely more limiting, because now dafny and its dependencies always has to take responsibility for distributing the computation. The CI for this repo, for example, uses multiple GitHub actions jobs across multiple runners to split up the integration test suite. It's definitely tractable to have each job log verification results and then have a meta-job summarize them (and we probably should do something like that to track complexity changes over time). If we require that there has to be a single top-level dafny invocation that distributes the work instead, we'd have to drop a huge amount of GHA functionality and take on our own fleet management.

I see, thanks!

How about a shard X/Y option that lets dafny check-proof-complexity and dafny verify only verify a subset of the proofs? If all the dafny check-proof-complexity calls pass then the CLI run passes. If one fails then you get a report for that shard, which I think would be fine even if it's partial. Even if multiple shards fail and you get multiple separate reports I think that would be a fine UX.

@robin-aws
Copy link
Member

How about a shard X/Y option that lets dafny check-proof-complexity and dafny verify only verify a subset of the proofs? If all the dafny check-proof-complexity calls pass then the CLI run passes. If one fails then you get a report for that shard, which I think would be fine even if it's partial. Even if multiple shards fail and you get multiple separate reports I think that would be a fine UX.

A shard X/Y option seems like a great idea. The Dafny CI is already relying on that functionality from lit (and the "xlit" equivalent I had to implement), but it can only shard based on files. Being able to shard the set of assertion batches would enable better balancing for lots of Dafny projects.

Getting a single unified report across shards would still be better IMO, so you can easily see the most expensive batches in one place rather than reading across multiple reports. At the same time I do like the fact that we have the --verification-logger:text option that is immediately readable without needing a separate command. Perhaps we should just have a report or summarize or merge command that takes multiple log files as input and produces a single log file or report.

@keyboardDrummer
Copy link
Member Author

Perhaps we should just have a report or summarize or merge command that takes multiple log files as input and produces a single log file or report.

Sounds good to me

@keyboardDrummer keyboardDrummer changed the title Introduce check-proof-complexity command Introduce check-proof-durability command Dec 12, 2022
DafnyOptions.RegisterLegacyBinding(Format, (o, v) => o.VerificationLoggerConfigs = v);
}
private static readonly Option<uint> Iterations = new("iterations", () => 10U,
$"Attempt to verify each proof n times with n random seeds, each seed derived from the previous one. {RandomSeed.Name} can be used to specify the first seed, which will otherwise be 0.") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NullReferenceException in CI is because you're referencing RandomSeed before it's defined below. :) Swapping the ordering of the two options fixed it for me.

Copy link
Member

@robin-aws robin-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not yet convinced we want this as a separate command. Since it's not actually looking at the logged verification data, I don't feel like it's actually "check"ing anything. That makes it feel closer to just "verifying with extra parameters".

DafnyOptions.RegisterLegacyBinding(IsolateAssertions, (o, v) => o.VcsSplitOnEveryAssert = v);
DafnyOptions.RegisterLegacyBinding(Format, (o, v) => o.VerificationLoggerConfigs = v);
}
private static readonly Option<uint> Iterations = new("iterations", () => 10U,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I try dafny check-proof-command --help I see these options as iterations instead of --iterations, so I think you need to include the dashes here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made sure all the options are used in tests.

};

public Command Create() {
var result = new Command("check-proof-durability", "(experimental) Check the durability of the program proofs. Be aware that the name, options and output of this command will change in the future.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fine for now, but should definitely explain what "durability" means here somehow before we take it out of "experimental" mode. I'd suggest we address #3174 and then link to the reference manual.


private static readonly Option<bool> IsolateAssertions = new("isolate-assertions", @"Verify each assertion in isolation.");
private static readonly Option<List<string>> Format = new("format", $@"
Logs verification results using the given test result format. The currently supported formats are `trx`, `csv`, and `text`. These are: the XML-based format commonly used for test results for .NET languages, a custom CSV schema, and a textual format meant for human consumption. You can provide configuration using the same string format as when using the --logger option for dotnet test, such as: --format ""trx;LogFileName=<...>.z""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Logs verification results using the given test result format. The currently supported formats are `trx`, `csv`, and `text`. These are: the XML-based format commonly used for test results for .NET languages, a custom CSV schema, and a textual format meant for human consumption. You can provide configuration using the same string format as when using the --logger option for dotnet test, such as: --format ""trx;LogFileName=<...>.z""
Logs verification results using the given test result format. The currently supported formats are `trx`, `csv`, and `text`. These are: the XML-based format commonly used for test results for .NET languages, a custom CSV schema, and a textual format meant for human consumption. You can provide configuration using the same string format as when using the --logger option for dotnet test, such as: --format ""trx;LogFileName=<...>""

ArgumentHelpName = "n"
};

private static readonly Option<uint> RandomSeed = new("random-seed", () => 0U,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we do want the separate command, shouldn't these be in the shared VerificationOptions bucket instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean RandomSeed and IsolateAssertions ? What's the use-case?

Do you think a Dafny project could be developed with --isolate-assertions on at all times? If not, how would you use it?

If the durability/complexity checker complains that a proof is too complex (saying a proof is not durable doesn't sound right), then you could think a user might want to do dafny verify --random-seed=<problematic-seed> to improve the proof. However, in that case I'd suggest they use dafny check-proof-durability --random-seed=<problematic-seed> instead, because then they'll be able to see whether they introduce other problematic seeds.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think a Dafny project could be developed with --isolate-assertions on at all times? If not, how would you use it?

It certainly could be, even though it's unlikely to scale well enough for typical projects. But I could even imagine degenerate cases of a small project with just a small number of really large methods where that would work well.

(saying a proof is not durable doesn't sound right)

Agreed and see my top-level comment.

then you could think a user might want to do dafny verify --random-seed= to improve the proof

Yeah, I don't think we'd want to encourage that except as a short-term hack, since it's likely not addressing the root cause of proofs that are too complex. But it would be useful in those cases, especially if you really just want to know if your code is valid and aren't yet concerned with durability.

@keyboardDrummer
Copy link
Member Author

keyboardDrummer commented Dec 13, 2022

Still not yet convinced we want this as a separate command. Since it's not actually looking at the logged verification data, I don't feel like it's actually "check"ing anything. That makes it feel closer to just "verifying with extra parameters".

Do you want me to add the checking to it? I was planning on keeping that out of scope for this PR.

@robin-aws
Copy link
Member

One final thought: I've actually wondered if verifying with multiple random seeds should be the default verification behavior, since it will be obviously slower but encourage more durable verification. We could offer the single pass as a "fast mode" you switch on for development, but with the understanding that it's like running code without enough tests: something you do as part of development but then catch up on before you cut a PR.

Or perhaps we just ensure that when you use multiple iterations by default you still get the results of the first pass early, especially when verification succeeds. Perhaps the ideal UX for verification should be a gradual ramp up of throwing more and more resources at the problem. You could get a quick "Success!" message or a "Hang on, still trying..." message, and in the background failing assertion batches could be quickly retried with different random seeds, or automatically split into smaller batches, or even farmed out to more powerful hosts.

All of this suggests to me that perhaps we really do just want to think of this as extra configuration on verify rather than as a separate activity.

@atomb
Copy link
Member

atomb commented Dec 14, 2022

* Perhaps "measure" or something similar is a bit better than "check", since that more accurately describes an exercise that just produces data rather than a pass/fail decision so far, but could imply blocking the build later when we do add the `dafny-reportgenerator` functionality. I'd like to take inspiration from the analogy with measuring testing coverage, and review what other ecosystems call it. We named the `dafny-reportgenerator` the way we did partially because that's the terminology .NET uses.

* I think I would leave out "proof" entirely since that's not actually a term that's applied at the Dafny abstraction level very much. The actual things we're measuring the complexity of are assertion batches after all. It seems reasonable given that `verify` also doesn't specify the thing that's being verified - it's implicitly everything in the whole program.

* I DO like "complexity"! And I wouldn't try to cram "durability" into the command, since we're measuring and trying to bound complexity under the assumption that it correlates with low durability. It's just like how higher testing coverage should correlate with higher code quality/fewer bugs: the actual verb would be to assert the testing coverage is above a certain threshold, not that the actual code quality is high enough since that's not directly measurable.

I think I agree with you on all of the terminology above. (Other thoughts in another comment below!)

@atomb
Copy link
Member

atomb commented Dec 14, 2022

All of this suggests to me that perhaps we really do just want to think of this as extra configuration on verify rather than as a separate activity.

Yeah, I agree here. I was leaning toward this making more sense as an option to verify before, and I think I'm leaning even more heavily in that direction now.

@keyboardDrummer
Copy link
Member Author

keyboardDrummer commented Dec 14, 2022

measure-complexity

Works for me.

One final thought: I've actually wondered if verifying with multiple random seeds should be the default verification behavior, since it will be obviously slower but encourage more durable verification. We could offer the single pass as a "fast mode" you switch on for development

I don't think we want this. There should be a command to verify Dafny programs during development that tries to provide immediate feedback. Any unnecessary slowdown for it is undesired. Right now, dafny verify is that command. If you make dafny verify slower by default, then we need another command, verify-fast.

Also, I don't think every Dafny developer will want to measure their proof complexity. It's an advanced concept that is only relevant when you're interested in developing software over time, not when you're playing around with Dafny. Any Dafny developer that is not in a development phase where they want to measure complexity, or that doesn't want to measure complexity at all, shouldn't be confronted with options related to it.

Also, I don't like it if there's options that don't make sense unless you also use other options. For example, with dafny super-verify, we'd have:
--iteration-mode=single|dynamic|maximum (dynamic will stop when the current relative standard-deviation (RSD) is so far from the threshold (above or below) that it's unlikely iterating further will change the result)
--rsd-threshold (threshold for the relative standard deviation that will determine if the command fails or not)
--maximum-iterations=

The last two options don't make sense unless you use a non-default value for the first one. Putting all of these options in a single command with --iteration-mode=dynamic|maximum ensures that the other two options are always relevant.

I would design dafny measure-complexity so it works well for the general case without having to specify any options.
What would be the option that you pass dafny super-verify to tell it to start iterating more than once? You might have dafny verify --iteration-mode=<something>, which is then what customers will put in their CI configuration. Why not specialise for that common scenario so they can write just dafny measure-complexity ?

Another thing that concerns me is that for dafny simple-iteration-verify it's clear what the output should be, but it's not clear what the output should be when you do multiple iterations. Do you want to see duplicate diagnostic failures if they occur in several iterations? Do you only want to see diagnostics when they occur in each failing iteration? Do you even want to see any diagnostics or only the statistical table?

Or perhaps we just ensure that when you use multiple iterations by default you still get the results of the first pass early, especially when verification succeeds. Perhaps the ideal UX for verification should be a gradual ramp up of throwing more and more resources at the problem. You could get a quick "Success!" message or a "Hang on, still trying..." message, and in the background failing assertion batches could be quickly retried with different random seeds, or automatically split into smaller batches, or even farmed out to more powerful hosts.

You need to decide when to return a success value, and once you do that you can't go back and say wait no it wasn't a success, because by that time the user is already doing something else.

@robin-aws
Copy link
Member

Okay I think I'm convinced, and the primary reason is that as you point out, verify and measure-complexity have different primary goals even though their implementation will always overlap substantially:

  • verify is testing whether the program is correct
  • measure-complexity is attempting to ensure verification is durable, technically independently of whether it is correct

I think there are good applications of iteration for verify as well, but the interpretation of multiple results will be different: verify should succeed if a second try with a different random seed happens to work, but measure-complexity should fail (at least by default) if not all iterations succeed. I also agree we shouldn't just expose the same --iterations option verbatim in both commands: in verify it would be more like a --dynamic-retries option, and likely smarter than just rerunning everything with different random seeds.

Also, I don't think every Dafny developer will want to measure their proof complexity. It's an advanced concept that is only relevant when you're interested in developing software over time, not when you're playing around with Dafny. Any Dafny developer that is not in a development phase where they want to measure complexity, or that doesn't want to measure complexity at all, shouldn't be confronted with options related to it.

That's fair as long as we make it very clear that every actual Dafny project should use it, so I wouldn't label it as "advanced" only. If and when we have a dafny new-project command, for example, it should absolutely hook up measure-complexity in CI if possible.

You need to decide when to return a success value, and once you do that you can't go back and say wait no it wasn't a success, because by that time the user is already doing something else.

If we're talking about verify rather than measure-complexity here, the initial success is enough and wouldn't change. If you happened to re-verify the whole program with a different random seed because some batches failed/timed out, getting a failure for that batch doesn't mean the success was wrong.

@keyboardDrummer keyboardDrummer changed the title Introduce check-proof-complexity command Introduce measure-complexity command Dec 14, 2022
@atomb
Copy link
Member

atomb commented Dec 14, 2022

I think maybe I'm convinced by the idea that verify and measure-complexity have different success conditions, and therefore different exit codes, and should be treated separately because of that.

One thing worth pointing out is that it's sometimes worthwhile to have complexity measurement succeed even if some proof attempts fail, as a way of using the mechanism to gate on things becoming worse. If you allow proofs to occasionally fail but still have a limit on variability of resource use then you can prevent drastic increases in complexity without requiring that all issues be resolved immediately.

Currently, Dafny fails if any of the iterations fail, but it probably makes sense to change that behavior so that it succeeds if any of them succeed (in "verify" mode).

So far, we've been delegating whether the verification is too complex by having the verification always succeed (in "complexity measurement" mode, sometimes by hacking it up with dafny ... || true) and delegating the question of whether CI passes to dafny-reportgenerator. Given that variability gating happens there, it seems like it makes sense for all decisions about whether the analysis passes to live there, but I'm open to having that not be true.

@robin-aws
Copy link
Member

@atomb good point about the exit code, I'd say it's reasonable to have measure-complexity fail by default if any iteration fails, since that's how dafny-reportgenerator behaves unless you pass --allow-different-outcomes. We could add an option like this too if needed. If and when we add some kind of iteration support to verify we can ensure different behaviour there, but that's safely out of scope for this PR.

We can also then later have measure-complexity also fail if any batch is above configured thresholds. Since we've marked it as experimental I think that's reasonable (and the bigger behaviour change would be setting default thresholds)

@atomb
Copy link
Member

atomb commented Dec 14, 2022

@atomb good point about the exit code, I'd say it's reasonable to have measure-complexity fail by default if any iteration fails, since that's how dafny-reportgenerator behaves unless you pass --allow-different-outcomes. We could add an option like this too if needed. If and when we add some kind of iteration support to verify we can ensure different behaviour there, but that's safely out of scope for this PR.

We can also then later have measure-complexity also fail if any batch is above configured thresholds. Since we've marked it as experimental I think that's reasonable (and the bigger behaviour change would be setting default thresholds)

So that would essentially mean moving more of the dafny-reportgenerator functionality into Dafny proper, which seems reasonable to me as long as we're doing it intentionally.

@keyboardDrummer
Copy link
Member Author

Some of the discussion has been about future work. Are there any remaining changes required from this PR right now?

Copy link
Member

@atomb atomb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be nice to add an --allow-different-outcomes flag, as we discussed, but that could wait until a future PR.

Copy link
Member

@robin-aws robin-aws left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the discussion has been about future work. Are there any remaining changes required from this PR right now?

Nope I think this is a good first step. I'm not yet sure exactly what the UX will look like for the use case of just doing the reportgenerator aggregation and testing based on existing log files, but I'm comfortable figuring that out in the future.

Thanks for enduring the naming bikeshedding! :)

ArgumentHelpName = "n"
};

private static readonly Option<bool> IsolateAssertions = new("--isolate-assertions", @"Verify each assertion in isolation.");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like a bit more detail here personally ("Verify each assertion in its own assertion batch. See (refman section) for details.") but not enough to block on.

@MikaelMayer
Copy link
Member

Arriving a bit late in the debate, thanks for explaining all the reasoning everyone that was helpful.
I'm a bit surprised that the consensus went to "measure" instead of "check", especially because all the previous discussions were highlighting how "measure-complexity" and "verify" should have two different exit codes.
Note that "verify" with flags can have more outputs and traces than just the exit code.
Similarly, for the command that we are introducing, if we agree that the error-code can vary, then calling it "check-complexity" would be perfectly fine, because even when we check something, we can give a better report.

On the other side "measure-complexity", "measure" gives the impression that the outcome would be a number between 0 and 100 about whether the proof are straightforward (0) or they actually take too complex (100). If we are not returning a score but a boolean, then I believe "check" is more appropriate than "measure". We can always output more information with specific flags about how to check and what to display.

Now, there is another mental model problem that the discussion went into without discussing it deeply.
For the common sense, the complexity of a proof is how much time or resources it would take to verify the proof. The more time or resources, the more complex the proof. Straightforward proofs take little time and resources.
So naturally, I believe the proof's complexity can depend on the seed. For some seeds, the proofs is straightforward, for other seeds, the solver tries cases forever and never stops.
If you don't believe me, ask someone about what it means for something to be complex.

What this PR is doing is comparing the complexity across runs. If this complexity
has a lower standard deviation (meaning very similar time/resources whatever the seed, and at least the same verification status), then this is what this PR will report.

What adjectives would you use to describe a process whose outcome has a low standard deviation when executed under slightly different environments? I asked this question to ChatGPT which is a very good language model. Here are his answers:

  • Consistent
  • Stable
  • Uniform
  • Predictable.

We can't use consistency because it has another meaning when something is axiomatized. Uniformity seems strange to me. Predictability assumes that we guess the outcome in advance.

So now, I understand better why "stability" was much more appropriate for this measure.
I now ask everyone to reconsider check-stability for the reasons mentioned above. We could also use check-solver-stability or check-smt-stability to indicate the verification status is not at stake, only if the solver is predictable. After all, we are only modifying the solver's environment to prove the same things.

@robin-aws
Copy link
Member

@MikaelMayer No worries about being late, we intentionally labelled the command experimental and even warned the name might change for this reason. :) Here are my thoughts.

I still strongly reject "stability" because the lack of it implies something that may be unreliable, non-deterministic, fragile, etc. None of these things apply to something that is successfully verified but may not be durable under future changes. Labelling programs or verification as unstable plants the seed that they should not be trusted. I will take this argument up with ChatGPT if necessary. :D

I believe the proof's complexity can depend on the seed.

This is completely fair, and we are intentionally under-specifying what "complexity" means here. My interpretation is that we're trying to measure the underlying complexity of the problem, because our hypothesis is that it correlates with low durability. If it involves a search through a huge space, indeed one seed might succeed quickly and another in a very long time. But we want to say that if we get this kind of wide variance, it's a clue that the underlying complexity is high.

As for "check" vs. "measure", you do have a point. My concern was that as is, this command only outputs data and not any kind of boolean decision. We would probably like to move the dafny-reportgenerator functionality into this command as well in the future. If and when we do that, I'm much more willing to consider renaming it to "check", but I don't want to 100% assume we will. It's also possible that we will add a separate command called something like bound-complexity that rejects assertion batches whose complexity measure(s) are too high. This command could semantically depend on first executing measure-complexity, although it might also analyze the data logged from a separate run, or even multiple runs across different processes or hosts as we discussed above.

@MikaelMayer
Copy link
Member

I still strongly reject "stability" because the lack of it implies something that may be unreliable, non-deterministic, fragile, etc.

The question is, do I reject "complexity" as much as you reject "stability" ? :-) I explained my reasons why" complexity" is just not appropriate in this context. Synonyms of that would have been "measure-resources" "measure-time".... is this what this command does? If its main purpose is to just provide the result without aggregating them (like a summary), then, "complexity" is fine.
I hadn't had the time to look at this command. But if it does something more (and cleverer) than just displaying the raw results for random seeds, then it should be named for what the output is, not the intermediate computations.
What is it doing here?

@robin-aws
Copy link
Member

The question is, do I reject "complexity" as much as you reject "stability" ? :-) I explained my reasons why" complexity" is just not appropriate in this context. Synonyms of that would have been "measure-resources" "measure-time".... is this what this command does? If its main purpose is to just provide the result without aggregating them (like a summary), then, "complexity" is fine.
I hadn't had the time to look at this command. But if it does something more (and cleverer) than just displaying the raw results for random seeds, then it should be named for what the output is, not the intermediate computations.
What is it doing here?

It's only measuring the dimensions you name and producing the data. It could easily measure other metrics that approximate "complexity" in the future.

@keyboardDrummer keyboardDrummer merged commit b7125bf into dafny-lang:master Dec 16, 2022
@keyboardDrummer keyboardDrummer deleted the stabilityCommand branch December 16, 2022 12:09
@keyboardDrummer
Copy link
Member Author

keyboardDrummer commented Dec 16, 2022

I now ask everyone to reconsider check-stability for the reasons mentioned above.

I don't like stability because I think the Dafny code itself is not unstable. I think the SMT solver is unstable. We want this command to tell the user something about their Dafny code, not about the SMT solver. So I think the question your should ask ChatGPT is: how do you call an input that causes a process to become unstable?

For fun:
"An input that causes a process to become unstable is often referred to as a "perturbation" or a "disturbance." In control systems engineering, a perturbation is any change in a system's input or environment that can cause the system to behave differently. For example, in a control system that regulates temperature, a sudden change in the ambient temperature could be considered a perturbation. If the system is not able to handle the perturbation and maintain stability, it may become unstable and behave in an unpredictable or undesired manner."

I'm a bit surprised that the consensus went to "measure" instead of "check"

I prefer check as well for the reason that you mention: that this command will return a boolean (once we update it).

@robin-aws
Copy link
Member

I think the SMT solver is unstable.

We need to avoid this statement too. It makes it sound like the solver is in alpha mode and shouldn't be used in production. And if we use "unstable" to refer to a fundamental limitation of solvers, we imply they should never be used in production.

@keyboardDrummer
Copy link
Member Author

keyboardDrummer commented Dec 16, 2022

I think the SMT solver is unstable.

We need to avoid this statement too. It makes it sound like the solver is in alpha mode and shouldn't be used in production.

Indeed there's the risk of someone interpreting this as the solver switching between sat and unsat. Maybe (un)decisiveness is a clearer description of the solver's problem. measure/check-solver-decisiveness could be an option but I prefer measure/check-[proof-]complexity.

@MikaelMayer
Copy link
Member

I think the SMT solver is unstable.

We need to avoid this statement too. It makes it sound like the solver is in alpha mode and shouldn't be used in production.

But you have a point: I think that solvers won't be relied if ever called from on-line code in production, i.e. code that is supposed to run every millisecond. Solvers are great to solve offline challenges, such as verifying code, but their unpredictable performance makes them poor candidates to solve online challenges when other specialized solutions exists (and they almost always exist for online code).

If you worry about the marketing effect, I would say that not saying anything about the solver's core instability to perturbations in large inputs is even worse. The only verification engines that don't suffer from this instability are proof assistants, because proofs can be verified in a polynomial and predictable amount of time - but at the expense of providing all the details and tactics.

If proof assistants were rails on which trains can't deviate, Dafny is more like a self-driving car. It can lead you much safer to any place way than human drivers ever could, and easier than trains, but if you try to ask minimum speed (i.e. your code is too complex, proofs are not straightforward), then Dafny will sometimes move correctly because it can find a path that satisfies your desires, but sometimes, it's too complex to find this path and the Dafny car will just not move (says it timeouts when trying to verify your code). Safety first. Even if the only difference was a butterfly on a stop sign. If I had a car with such unstability, I would still enjoy riding it because it provides guarantees around this unstability (i.e. safety).

Or take a GPS app, for example. Users know that the path is unstable (if you don't go in the provided direction, you might end up on a completely different path), but all paths lead to the goal. So the unstability is not the problem.

Solvers are unstable with respect to their input when it is large, but their output is always trusted as it is consistent. At worst, a solver will just say "don't know" and that's ok. Solver outputs have a partial order (don't know < unsat) and (don't know < sat) and the solver tries to maximize the output, as a constraint solver would try to maximize the number of constraints solved. And it's easy to get into NP-hard problems.

I'm not suggesting check-solver-reliability, but merely check-solver-timings or something that conveys that we want to reduce the time solvers takes in general to solve a problem to avoid timeout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants