-
-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Value "refinements" #153
Merged
Merged
Value "refinements" #153
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Previously the internal definition of an unknown value was any value whose raw value "v" field was exactly equal to the unexported global "unknown". In preparation for supporting optional refinement of unknown values we'll now broaden that definition to any value where "v" is an instance of *unknownType, regardless of exact value. In later commits we'll add optional fields inside struct unknownType to capture the optional additional refinements. "Refinements" means extra constraints on an unknown value that don't affect its type but do constrain the possible concrete values the unknown value could be a placeholder for.
This is a new concept allowing for an unknown value to optionally carry some additional information about constraints on the value for which it is a placeholder. Refinements do not affect the type of the value or most equality testing rules. As of this commit they also don't affect any other operations on values, but in future commits we'll use refinements to improve the precision of certain operations on unknown values if the refinements give enough information to still produce a known result. For the moment this is limited just to a few key cases that seem likely to help produce known results in common operations involving unknown values: - Equality tests will null, which are commonly used as part of conditional branches in calling applications. - Numeric bounds, in particular focused on comparison to zero but general enough to help with other numeric comparisons too. - Bounds on collection length, to use in conjunction with the numeric bounds so that asking for the length of an unknown collection can return an unknown number with numeric bounds. - String prefixes, to help with string-prefix-related tests that seem to be a common subset of validation checks in calling applications. Applications will tend to need to do some work to refine their own values in order to benefit fully from this, but we'll also plumb refinements into some of the built-in value operations and stdlib functions over time so that even applications not working directly with refinements can still get some indirect benefit.
Using the "refinements" feature we can refine the range of any unknown values from Equal to exclude cty.NullVal(cty.Bool), since we know that equality tests can never return null. We can also return a known False if the test is between a null value and an unknown value which has been refined as non-null.
The length of a collection is never null and always at least zero. For sets containing unknown values in particular we also know that they must have at least one element (otherwise there wouldn't be an unknown value) and that they can't have any more elements than currently stored (because coalescing can't ever generate new elements).
We can refine the unknown value results for all of the arithmetic functions to represent that they cannot possibly be null. The Value.Absolute method is additionally guaranteed to return a value greater than or equal to zero, by definition.
Many of our operation methods are guaranteed to never return null, so we'll refine the unknown values they return to specify that.
This slightly reduces the result range described by unknown results from these functions so that downstream operations on the results have more chance of producing known results even though their input is unknown.
I initially just copied the refinement model for numeric range, but in retrospect that was silly because we know that collection lengths are always integers and so there's no need to model unknown values and fractional values and whatever else. Instead, we now just have an integer lower and upper bound, always known but when unconstrained they are the smallest and largest values of int respectively.
The new concept of "refinements" for unknown values allows us to track some additional constraints on the range of an unknown value. Normally those constraints would get lost during serialization, but for msgpack in particular we can use an extension code to represent an unknown value that has refinements and then include a serialization of the refinements inside its body. This then allows preserving the refinements when communicating over wire protocols that use msgpack as their base serialization. However, as of this commit the unmarshal step isn't quite complete becausethe refinement API for numeric range is incomplete. We'll address both the refinement API and the msgpack package's use of it in a future commit. This required upgrading to a newer major version of the msgpack library because v4 did not have all of the functionality required to encode a raw extension object's payload.
The API for declaring numeric bounds was previously rather awkward and inconsistent with the API for inspecting them in the ValueRange type. Now we use separate methods for defining the lower bound and upper bound, allowing the inclusiveness of each one to be independent. We retain the previous NumberRangeInclusive helper for setting both bounds at once, but it's now just a thin wrapper around the other two.
Previously we considered it a usage error to assert a bound that was outside of the existing refined range. However, if we already know that a number value is less than 2 then it isn't really an error to assert that it's also less than 4; the second assertion just doesn't give us any new information and so we should silently discard it and retain the original assertion. It's still an error to make assertions that directly contradict existing refinements, such as asserting that an unknown value isn't null and then later asserting that it's null.
When we're being asked for the length of an unknown collection we can always put at least _some_ bounds on the range of the result, because we know that a collection can't have a negative length and it cannot have a length that is bigger than the current machine's integer size because otherwise we wouldn't be able to index into it. For some collections we may have further refinements on the length of the unknown collection, in which case we'll also transfer those to being refinements on the range of the numeric result.
This allows us to return a known result in more cases than before by making deductions based on the known numeric range. This first implementation is a little more conservative than it strictly needs to be because it just treats all bounds as exclusive rather than taking into account inclusive bounds. This makes the implementation a bit simpler for this initial round though, and should never generate an incorrect known result.
When given an unknown value, the Strlen function will now return a refined unknown value with an inclusive lower bound. This bound is always at least zero because a string can't have a negative length, but if we also know a prefix for the string then we can refine the result even more to take that into account.
When comparing a known value to an unknown value we may be able to return a definitive False if we can prove that the known value isn't in the range of the unknown value.
Because cty strings are unicode texts following both the normalization and text segmentation rules, we need to be cautious in how we model prefix matching of strings in the refinement system. To avoid problems caused by combining character sequences, we'll trim off characters from the end of a prefix if we cannot guarantee that they won't combine with something that follows. This then avoids us overpromising what the prefix will be in situations where the last character might change after appending the rest of the string. For applications that fully control their values and can guarantee that no combining will occur at the boundary, the new function StringPrefixFull retains the previous behavior of StringPrefix. The idea here is to be conservative by default but allow applications to be less conservative if they are able to offer greater guarantees.
When encoding a JSON string we will often be able to predict the first character of the result based just on the argument type. That wouldn't be worthwhile alone, but we can also use a known first character of a JSON document to infer a specific result type even for an unknown value, and to raise an error immediately if the known prefix cannot possibly be a JSON value of any type. Unfortunately due to how JSON interacts with the cty type system the only case where type information truly carries through in a round-trip is encoding an unknown string and then decoding that unknown result, but JSON may be produced and consumed by components other than this pair of functions and they may be able to reach more definitive conclusions based on this additional information.
It's common for format strings to begin with at least a few literal characters before the first verb, and if so we can predict the prefix of the result even if some of the remaining arguments are unknown.
Previously we were preserving marks only in the simple cases, and not properly dealing with already-marked known values when checking refinements. Now we'll always unmark a value on entry into Refine, and then reapply the same marks on all return paths out of RefinementBuilder.NewValue.
When we're converting from a structural type to a collection type the source type gives us information to refine the range of the resulting collection even if the input value is unknown. Similarly when converting from one collection type to another we can transfer the length range constraints because, aside from set element coalescing, conversions cannot change the length of a collection.
We know that the maximum possible cardinality for a resulting set is the product of the cardinalities of all of the input sets. We might not know the exact cardinalities of all of the input sets either, but if we do know an upper bound then we can use that to calculate the maximum cardinality of the result instead.
Previously we were treating these as unknown, which is also a reasonable way to model a lack of bounds but is less convenient when we want to do arithmetic against the bounds.
If we are performing addition, subtraction, or multiplication on unknown numbers with known numeric bounds then we can propagate bounds to the result by performing interval arithmetic. This is not as complete as it could be because of trying to share a single implementation across all of the functions while still dealing with all of their panic edge cases.
Now that we're in the business of trying to produce known values from operations on unknown inputs where possible, these are some simple cases that hold regardless of the range/refinements of the unknown operands and can help take a particular unknown result out of consideration when evaluating a broader expression.
If the list argument is an unknown value or contains unknown values then we can't possibly return a fully-known result, but we do at least know that sorting will never change the number of elements and so we can refine our unknown result using the range of the input value. The refinements system automatically collapses an unknown list collection whose upper and lower length bounds are equal into a known list where all elements are unknown, so this automatically preserves the known-ness of the input length in the case where we're given a known list with unknown elements, without needing to handle that as a special case here.
So far I've just left this implied because mostly I was the one responsible for upgrading this library in various applications that use it (by virtue of an employment coincidence). However, the (indirect) use of this library has broadened lately, so it's worth being clear about what does not constitute a breaking change in a minor release so that application developers can have some guidance on what to do when they see a test failure after upgrading and so that hopefully they can avoid depending on implementation details in their main code, outside of unit tests.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Originally the purpose of unknown values in
cty
was primarily to allow type checking of expressions whose input values are not fully known yet. Consequently they had a pretty simple model: an unknown value has a type constraint and operations on unknown values check that the type constraints are valid but otherwise just return even more unknown values to allow type checking to continue with subsequent derived expressions.However, this limited approach misses some opportunities to use known constraints on the range of results from certain operations to allow some operations on unknown values to return known values, or at least values that are "more known" (have a smaller range than just meeting a type constraint).
This changeset introduces a new model of "refinements" that can constrain the range of an unknown value beyond just a type constraint, and some early work to make use of that new information to return refined values from various built-in operations.
This is backward-compatible with the old behavior in the sense that conventionally we don't consider it a breaking change for a result to be "more known" in a new minor release than it was in the previous release. However, applications depending on operation methods -- in particular
Value.RawEquals
, but not limited to that -- may find that their unit tests now produce slightly different results where previously-unknown values are now either fully known or are refined with a more limited range. If the new refined range seems correct then I'd suggest updating those tests as part of upgradingcty
; if the refined range seems incorrect then that may be a bug, so please open a new issue to discuss it.Future changes might improve the refinements analysis further, causing even smaller ranges for unknown values. However, refinement analysis is always "best effort" and so it is not a bug for
cty
to return an unknown value with a wider range than necessary: narrowing that range might be an opportunity for improvement in a future minor release.