Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TokenWrapper to be a concrete all-in-one-class instead of the token hierarchy #3354

Open
MikaelMayer opened this issue Jan 11, 2023 · 4 comments
Labels
kind: language development speed Slows down development of Dafny the language, flaky tests status: designed Issues that have a complete story on how to implement this feature and why we want it

Comments

@MikaelMayer
Copy link
Member

Context

Although tokens are normally output by the scanner and provided to the parser, Dafny developers found a loophole and replaced regular tokens with rich tokens, which can attach useful metadata to expression. This metadata can help identify if an expression was included in another file, was automatically generated, was refined from another expression. It also makes sense to send such rich tokens to Boogie, because Boogie attach them to errors it finds. This gives a quick way to attach more information to display meaningful errors, besides ProofObligationDescription. Here are the rich tokens Dafny is currently defining and using:

  • A RefinementToken stores the module this token is refining
  • An IncludeToken, states that this token was included from another file and contain the file name where the include directive was written.
  • A Nestedtoken, states that this token relates to another token (typically to indicate related positions) and can even store a message about what the inner token is bringing
  • A RangeToken, to indicate a start and end token range
  • An AutoGeneratedToken, to indicate this token did not exist in the original source code, often used in the resolver.
  • A QuantifiedVariableDomain and QuantifiedVariableRange tokens, used to make better reporting
  • A ForceCheckToken which is supposed to hide inheritance.

These rich tokens are used for error reporting, for example to display related ranges and messages (NestedToken), customize the error message (QuantifiedVariableDomain), or just not report an error (Includetoken)

Besides rich tokens, ProofConditionDescriptions can be provided for Boogie assertions so that, whether if fails or succeeds, one can know both their token above, and their ProofConditionDescription, to extract failure or success messages.
Cloners also have a method “Tok” meant at cloning tokens and can wrap tokens as well, add refinement information.

Problem description

Problems arise when you consider the canonical nesting of tokens, for example,

  • Should a RangeToken from another file be a RangeToken of two includeToken, or an IncludeToken of a range token?
  • Should a refined range be defined rather as a range of refined start and end tokens?
  • Quid of a nested token of range tokens of refined tokens of included tokens?!!....
  • Errors in included files could be displayed as related location, but they are not, because the error is reported on the “include” token only, i.e. a nested token is missing.
  • To check if a token is inheriting a certain module definition, there are two level extractions to unwrap nested tokens and autogenerated tokens in a while loop. (see RefinementToken.IsInherited). This is clearly not robust.
  • What does it mean to “clone” a token really? It looks like most of the time, it’s to add metadata, which has little to do with the token itself.
  • Tokens are no longer a record but a class (because of their linked list) so "cloning" them deeply is difficult

There is a bigger problem now because of #3290
If we followed the pattern above, we should just add a “format token” that contains the original token and the range. But then, the canonical ordering of tokens is even harder to define.

Possible solutions

Do nothing about it. This has the problems above, but we could just add cases whenever we see one that is not covered.

Define a canonical ordering of tokens so that their nesting order is checked at run-time. Could we even make such order checked at compile time using a specific hierarchy of tokens?
A big inconvenient is to have to traverse tokens to get valuable information anyway.
If we have a canonical ordering, we might as well have a single type of token, as below.

  • A single record TokenWrapper (my preference)
  • Besides the regular Token class, we remove all other classes extending TokenWrapper and incorporate their semantics in TokenWrapper itself by making it concrete, with the following properties:
    • WrappedToken: This is the original untouched token that the TokenWrapper is “wrapping”.
      WrappedToken cannot be a TokenWrapper itself. It has to be a regular Dafny.Token (not even a Dafny.IToken).
    • InnerToken: This can be either a Token or a TokenWrapper. It’s borrowed from NestedToken.
    • Message: Either null or a string. This is borrowed from NestedToken, to describe why the InnerToken is necessary.
    • IncludedIn: Either null or an Include which contains the included file name. If a string, then the filename of the TokenWrapper is overriden by IncludedIn (the filename of the included file can still be recovered through WrappedToken.FileName).
    • RefinedFrom: Either a module declaration or null. Replaces the need for RefinedToken. That way, there is no canonical order problem between IncludedIn and RefinedFrom
    • StartToken: A regular token only. If this token describes the position of an expression, StartToken describes the start position. Borrowed from RangeToken
    • EndToken: A regular token only. If this token describes the position of an expression, StartToken describes the end position. Borrowed from RangeToken
    • AutoGenerated: A boolean indicating if the attached expression was auto-generated, meaning its wrapped token is not relevant. It would have no StartToken and no EndToken either, WrappedToken could still point to an original token, for error reporting (but not for token ownership)
    • QuantifiedVariableDomain: A boolean
    • QuantifiedVariableRange: A boolean
    • ForceCheckToken: A boolean
    • Any other formatting field, such as the base in which a number was parsed.

Advantages: Extraction is easy and canonical. No need to change things. Cloning would be easy.

Why is this important?

This would help solve the following issue in a simple way
#3290
Rather than defining a new Format, we could just define RichToken, have cloners add information to RichTokens by cloning them as records. That way, failing assertions could even report both a token and a range. The range is good for underlining, but the token helps disambiguating where the failing ranges are.

@keyboardDrummer introduces a way to access the parsed document on the language server.
#3338
However, it does so by “cloning” the AST being transformed, so it’s essential that all the formatting is not lost, especially the RangeToken. OwnedTokens can be computed after that independently.

When 3290 is done, I can proceed with the migration of the formatting PR to the new token format.
#2399
This will make it possible to remove the need of two tree traversal (one pre-parser, the other post-parser)

We want to remove the concrete children that I introduce in my PR of the Formatter.
#3339
The only way to do that is to ensure Children computes pre-resolution trees if called before resolution. But given the architecture of the language server, that tree would be a clone, and clones currently don’t clone RangeTokens, which is used for OwnedTokens, that the formater requires...

@davidcok
Copy link
Collaborator

I've also added a CodeActionToken, which is a RangeToken specifically recognized as a range relevant for code actions.

But all I really need is to know (a) that this is the COdeAction information and (b) have a start and end position. The fact that a RangeToken contains two tokens is a nuisance from my point of view, especially that the actual range is from start through the end, not up to the end. -- or at least I need that as an option, and the easiest way is just to have a start and length.

@MikaelMayer
Copy link
Member Author

We could put a token property named CodeAction in TokenWrapper as well. It's part of the flat metadata of the token anyway, and as you states, is different from Rangetoken, in that it could contain one (or more ?) ranges.

Why is it a nuisance that RangeToken contains two tokens? Would you prefer that we record StartToken and EndToken independently, and that RangeToken is just a regular token that does not contain them?

@davidcok
Copy link
Collaborator

For the purpose of a simple CodeAction, all that is needed is a range (start and length), which may not correspond to a token of the AST. It may well be that more complex code actions will need a reference to a part of the AST (e.g. a renaming or restructuring), but I have not progressed far enough to know what information will be needed for those. I suppose it might be better actually to base CodeActionToken on Token instead of RangeToken. What does RangeToken actually need both tokens for? Eventually one just extracts an LspRange anyway?

@MikaelMayer
Copy link
Member Author

What does RangeToken actually need both tokens for? Eventually one just extracts an LspRange anyway?

RangeToken is actually just a way to record the start position and a length, and since we need the start token and end token of an expression for other reasons, I found it was the natural way to holding these two together. RangeTokens are used to display errors not just on tokens but on entire expressions, both on the CLI with --show-snippets and on the LSP.

@MikaelMayer MikaelMayer added kind: language development speed Slows down development of Dafny the language, flaky tests status: designed Issues that have a complete story on how to implement this feature and why we want it labels Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind: language development speed Slows down development of Dafny the language, flaky tests status: designed Issues that have a complete story on how to implement this feature and why we want it
Projects
None yet
Development

No branches or pull requests

2 participants