Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Tree Borrows Provenance GC compact the tree #3837

Merged
merged 3 commits into from
Aug 28, 2024

Conversation

JoJoDeveloping
Copy link
Contributor

@JoJoDeveloping JoJoDeveloping commented Aug 23, 2024

Follow-up on #3833 and #3835. In these PRs, the TB GC was fixed to no longer cause a stack overflow. One test that motivated it was the test fill::horizontal_line in tiny-skia. But not causing stack overflows was not a large improvents, since it did not fix the fundamental issue: The tree was too large. The test now ran, but it required gigabytes of memory and hours of time (only for it to be OOM-killed 🤬), whereas it finishes within 24 seconds in Stacked Borrows. With this merged, it finishes in about 40 seconds under TB.

The problem in that test was that it used slice::chunked to iterate a slice in chunks. That iterator is written to reborrow at each call to next, which creates a linear tree with a bunch of intermediary nodes, which also fragments the RangeMap for that allocation.

The solution is to now compact the tree, so that these interior nodes are removed. Care is taken to not remove nodes that are protected, or that otherwise restrict their children.

I am currently only 99% sure that this is sound, and I do also think that this could compact even more. So @Vanille-N please also have a look at whether I got the compacting logic right.

For a more visual comparison, here is a gist of what the tree looks like at one point during that test, with and without compacting.

This new GC requires a different iteration order during accesses (since the current one can make the error messages non-deterministic), so it is rebased on top of #3843 and requires that PR to be merged first.

Comment on lines 277 to 306
pub fn can_be_replaced_by_child(self, child: Self) -> bool {
match self.inner.partial_cmp(&child.inner) {
Some(Ordering::Less) | Some(Ordering::Equal) => true,
Some(Ordering::Greater) | None => false,
}
}
Copy link
Contributor Author

@JoJoDeveloping JoJoDeveloping Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the magic function. I think that this is sound, i.e. if it returns true, then the child can indeed replace the parent.

Here is an example: The parent is Active, and the child is Frozen. On a foreign write, both go to Disabled. On a foreign read, both go to Frozen. On a child read, both remain; and the Frozen tag blocks child writes. So the Active has no effect and can be compacted by replacing it with its child.

Comment on lines 277 to 306
pub fn can_be_replaced_by_child(self, child: Self) -> bool {
match self.inner.partial_cmp(&child.inner) {
Some(Ordering::Less) | Some(Ordering::Equal) => true,
Some(Ordering::Greater) | None => false,
}
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't think this function is complete. For once, if the parent is ReservedIM, and the child is Reserved, we can still replace the parent by the child, whereas partial_cmp says that they are incomparable.

Copy link
Contributor Author

@JoJoDeveloping JoJoDeveloping Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the handling of Reserved {conflicted: true} parents is very subtle:

  • Usually, it's not OK to replace a conflicted Reserved by an Active. But a Reserved never has Active children anyways (as all Active have Active parents), and further, since we know the parent is not protected, we know that the conflictedness does not matter anyway
  • This exposes another way the method is incomplete: a conflicted Reserved parent can be replaced by a non-conflicted Reserved child, since the parent's conflictedness no longer matters, as the parent is not protected. But this is not how partial_cmp works.

(Note that I am highlighting all these cases to show that the reasoning here is indeed subtle in many cases)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial_cmp on permissions denotes reachability, and is not by definition tied to the amount of UB that the permission can produce. I like the idea of can_be_replaced_by_child but I think the above implementation suggests that the replaceability derives from partial_cmp when it's actually very dependent on invariants of the tree. I think it should be decoupled from partial_cmp and hand-written to explicitly show all compactable cases with their justification.

Copy link
Contributor

@Vanille-N Vanille-N Aug 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example the fact that you can compact an Active parent with a Frozen child is entirely dependent on whether the Active is protected, so there are at minimum many unwritten assumptions in can_be_replaced_by_child about the fact that protected tags can't be GC'd and the valid parent-child combinations.
If we make the match explicit we can add a sanity assert(!parent_prot) in the parent: Active, child: Frozen case.

Alright, the diff had cut off the comment above so I hadn't seen that the parent not being protected is at least documented. I still think we should decouple replaceability from reachability.

Copy link
Contributor Author

@JoJoDeveloping JoJoDeveloping Aug 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we do not remove protected parents. Such parents should never be GC'd anyways since they still have a protector end write coming.

Note that I do think there's a deeper connection between reachability and replaceability: Permissions that come "later" always have "less" permissions than those that come earlier. It should always be sound to replace a node by one that's further ahead in the state machine (protectors notwithstanding). While this is not "by definition," it is the underlying principle making TB work. But I agree that it should probably be decoupled in code.

Comment on lines 733 to 736
if global.borrow().protected_tags.contains_key(&child.tag) {
// todo think really hard whether we can't just handle this as we would regularly
return None;
}
Copy link
Contributor Author

@JoJoDeveloping JoJoDeveloping Aug 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this check is necessary. I reckon that it is not, i.e. we can merge protected tags up. Part of the intuition is:

  • Almost all the protector does is make more transitions UB. Since we do not remove the protector, these same transitions remain UB
  • The one exception is that Reserved becomes conflicted on foreign accesses -- but this should not cause any issues, since the foreign read still happens similarly on the child.
  • We still know the parent is not protected

@JoJoDeveloping
Copy link
Contributor Author

JoJoDeveloping commented Aug 23, 2024

We should probably have a definition of replaceable(parent, child) and then exhaustively test that:

Let p_parent and p_child be two permissions such that replaceable(p_parent, p_child), and where p_parent is not protected. Then all accesses to them either

  • move them further along such that afterwards replaceable(p_parent', p_child') (they remain in simulation)
  • cause UB. In that case, the UB must also be caused if only p_child is accessed in that manner.

@Vanille-N can you add such a test? Note that replaceable is this method. This has been added now.

@JoJoDeveloping
Copy link
Contributor Author

JoJoDeveloping commented Aug 23, 2024

The tests are failing for several reasons, all related to the fact that it's a "GC-stress" mode test where the GC runs every tick.

First, linux-futex.rs fails because (presumably) the GC now does a bunch more work, so things get delayed, and the time taken for an operation is outside of the acceptible range.

The other tests because the GC messes with the tree structure, combined with a suboptimal tree visiting order when checking accesses. To check an access, one currently starts at the root, and first checks that all nodes from there downwards support a child access. If the violating node (i.e. one not supporting the access) is such a parent, the error message will point this out (and print information about both tags). But quite often, this parent is simply the immediate parent of the node that was accessed, and not used otherwise, so the GC has merged it with the child. Since then it's not a violating parent but the node itself, the error message changes. And this makes the UI tests fail.

My solution would be to change the iteration order to go upwards but not downwards, which would be resilient to GC tree compactings. The comments currently in code seem to imply that the current iteration order has a deeper meaning, but sadly without going any deeper. @Vanille-N do you know more?

@Vanille-N
Copy link
Contributor

My solution would be to change the iteration order to go upwards but not downwards, which would be resilient to GC tree compactings. The comments currently in code seem to imply that the current iteration order has a deeper meaning, but sadly without going any deeper. @Vanille-N do you know more?

Sure. The current iteration order is of course a heuristic (TB doesn't specifically require an order to these) in an attempt to quickly identify the "root cause" of the issue, with a priority for more straightforward errors.

Insufficient permissions are more straightforward than protector violations, so we check parents before children.
The hidden assumption here is that parents can only trigger UB of the kind "insufficient permissions", and children can only trigger UB of the kind "protector violation".

As for the iteration order (top-down), the reasoning is that if we have x parent of y parent of z all Disabled and we do a child access through z, we are going to get the same kind of UB whichever we choose to report, but if we report z and the user fixes just z then we still get UB on y on the next run. At the time of writing this code I guessed that z was more likely to be the root cause of the bug and get the user to fix the UB in fewer steps. I don't have concrete evidence this is mostly guesswork.

I recognize that the combination of this iteration order and this tree compacting strategy can lead to nondeterministic error messages. Fixing that should take absolute priority over whatever diagnostics heuristics is implemented, and any iteration order will probably still give good enough messages.

@Vanille-N
Copy link
Contributor

Vanille-N commented Aug 24, 2024

Finally, from what I understand (I haven't looked at the detail of your compaction logic yet), you seem to be keeping the child and deleting the parent. This is not the only compacting logic possible.

The one I had in mind originally was to do the opposite : look at the parent to determine if we can delete the child. This removes the constraint of "if the parent is a singleton" because it preserves the ancestry relation even in branching trees.
For example it seems to me that your approach is not capable of compacting

0: Frozen
   |-- 1: Frozen (GC)
   |      |-- 2: Frozen
   |      '-- 3: Frozen
   '-- 4: Frozen (GC)
          |-- 5: Frozen
          '-- 6: Frozen

into

0: Frozen
   |-- 2: Frozen
   |-- 3: Frozen
   |-- 5: Frozen
   '-- 6: Frozen

I don't doubt that your bottom-up compacting can merge some configurations that a top-down approach cannot, and maybe those are more frequent, but the point is that the iteration order that I chose is compatible with the compacting logic that I initially had in mind.

@Vanille-N
Copy link
Contributor

Also since wide trees cannot get very deep, it's probably fine to have a compacting logic that can't handle branching if we observe that issues come mostly from concrete examples where the tree is deep.

@JoJoDeveloping
Copy link
Contributor Author

The one I had in mind originally was to do the opposite : look at the parent to determine if we can delete the child. This removes the constraint of "if the parent is a singleton" because it preserves the ancestry relation even in branching trees.

Compared with that approach, doing it like I do here has several advantages:

  • It is in-place: Since we only ever move a child up, we don't need to extend the array of children
  • It is faster: Since we only check single-child nodes, there will never be a case where we check a long array only to realize we can't compact at the last node.
  • It preserves the branching structure: In other words, not flattening nodes with multiple children can be seen as a feature.

Also note that the number of additional nodes that your approach would remove, but mine would not, is bounded: My approach produces trees at most twice as large as yours. The reason is that in a binary tree, the number of interior nodes is at most the number of leaves. So the efficiency gains offered by an even more collapsing GC are constant.

@JoJoDeveloping
Copy link
Contributor Author

JoJoDeveloping commented Aug 24, 2024

Also since wide trees cannot get very deep, it's probably fine to have a compacting logic that can't handle branching if we observe that issues come mostly from concrete examples where the tree is deep.

Note that further optimizations are possible elsewhere to handle wide trees. For instance, those implemented in this branch, which makes reads and writes skip more subtrees. But I did not see them cause a huge speedup, and they also affect diagnostics in rare cases, so I did not PR them yet.

@RalfJung
Copy link
Member

We should probably have a definition of replaceable(parent, child) and then exhaustively test that:

Let p_parent and p_child be two permissions such that replaceable(p_parent, p_child), and where p_parent is not protected. Then all accesses to them either

* move them further along such that afterwards `replaceable(p_parent', p_child')` (they remain in simulation)

* cause UB. In that case, the UB must also be caused if only `p_child` is accessed in that manner.

@Vanille-N can you add such a test? Note that replaceable is this method.

Is this the right check? "Remain in simulation" is very weak. For instance, juts making replaceable always return true seems to satisfy this check, but is obviously not a valid implementation for replaceable.

@JoJoDeveloping
Copy link
Contributor Author

JoJoDeveloping commented Aug 24, 2024

Is this the right check? "Remain in simulation" is very weak. For instance, juts making replaceable always return true seems to satisfy this check, but is obviously not a valid implementation for replaceable.

Such an implementation would fail the second condition (e.g. if the parent is Frozen and to be replaced by a child being Active, child writes will be UB only at the parent). And I think that the relation we are looking for here is precisely "the largest simulation where UB at the child implies UB at the parent."

Note that whether or not something causes UB is the only observable action here. So formally, the parent must simulate the child, including the child's steps to UB.

JoJoDeveloping added a commit to JoJoDeveloping/miri that referenced this pull request Aug 25, 2024
In preparation for rust-lang#3837, the tree traversal needs to be made bottom-up,
because the current top-down tree traversal, coupled with that PR's
changes to the garbage collector, can introduce non-deterministic error
messages if the GC removes a parent tag of the accessed tag that would
have triggered the error first.

This is a breaking change for the diagnostics emitted by TB. The
implemented semantics stay the same.
JoJoDeveloping added a commit to JoJoDeveloping/miri that referenced this pull request Aug 25, 2024
In preparation for rust-lang#3837, the tree traversal needs to be made bottom-up,
because the current top-down tree traversal, coupled with that PR's
changes to the garbage collector, can introduce non-deterministic error
messages if the GC removes a parent tag of the accessed tag that would
have triggered the error first.

This is a breaking change for the diagnostics emitted by TB. The
implemented semantics stay the same.
@RalfJung
Copy link
Member

Such an implementation would fail the second condition (e.g. if the parent is Frozen and to be replaced by a child being Active, child writes will be UB only at the parent).

Your comment says they have to satisfy one of the conditions, and the first one is trivially satisfied. So it's okay to violate the second condition, the entire soundness criterion you are suggesting is still satisfied as far as I can tell.

@JoJoDeveloping
Copy link
Contributor Author

Your comment says they have to satisfy one of the conditions, and the first one is trivially satisfied. So it's okay to violate the second condition, the entire soundness criterion you are suggesting is still satisfied as far as I can tell.

I mean you either move along or cause UB, you can't have both at the same time, so that's why it's "either."

@RalfJung
Copy link
Member

Ah, "move along" implies "without UB". :)

@bors
Copy link
Contributor

bors commented Aug 27, 2024

☔ The latest upstream changes (presumably #3847) made this pull request unmergeable. Please resolve the merge conflicts.

JoJoDeveloping added a commit to JoJoDeveloping/miri that referenced this pull request Aug 27, 2024
In preparation for rust-lang#3837, the tree traversal needs to be made bottom-up,
because the current top-down tree traversal, coupled with that PR's
changes to the garbage collector, can introduce non-deterministic error
messages if the GC removes a parent tag of the accessed tag that would
have triggered the error first.

This is a breaking change for the diagnostics emitted by TB. The
implemented semantics stay the same.
bors added a commit that referenced this pull request Aug 27, 2024
Make TB tree traversal bottom-up

In preparation for #3837, the tree traversal needs to be made bottom-up, because the current top-down tree traversal, coupled with that PR's changes to the garbage collector, can introduce non-deterministic error messages if the GC removes a parent tag of the accessed tag that would have triggered the error first.

This is a breaking change for the diagnostics emitted by TB. The implemented semantics stay the same.
Follow-up on rust-lang#3833 and rust-lang#3835. In these PRs, the TB GC was fixed to no
longer cause a stack overflow. One test that motivated it was the test
`fill::horizontal_line` in `tiny_skia`. But not causing stack overflows
was not a large improvents, since it did not fix the fundamental issue:
The tree was too large. The test now ran, but it required gigabytes of
memory and hours of time, whereas it finishes within seconds in Stacked
Borrows.

The problem in that test was that it used [`slice::chunked`](https://doc.rust-lang.org/std/primitive.slice.html#method.chunks) to iterate
a slice in chunks. That iterator is written to reborrow at each call to
`next`, which creates a linear tree with a bunch of intermediary nodes,
which also fragments the `RangeMap` for that allocation.

The solution is to now compact the tree, so that these interior nodes
are removed. Care is taken to not remove nodes that are protected, or
that otherwise restrict their children.
@JoJoDeveloping
Copy link
Contributor Author

JoJoDeveloping commented Aug 27, 2024

I squashed all the comments here. The reason is that I made some changes to the overall garbage collector infrastructure in the first commit here, that I undid in one of the later commits (because turns out it's unnecessary), but when rebasing to the master branch where the GC also changed, git got confused, and I could not be bothered to correctly replay my changes as I knew that a later commit would undo them.

@JoJoDeveloping JoJoDeveloping marked this pull request as ready for review August 27, 2024 18:59
Copy link
Member

@RalfJung RalfJung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have to check the tests

src/borrow_tracker/tree_borrows/perms.rs Show resolved Hide resolved
// Active can be replaced by Frozen, since it is not protected
(Active, Frozen) => true,
(Active, Disabled) => true,
// Frozen can only be replaced by Disabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Frozen can only be replaced by Disabled
// Frozen can only be replaced by Disabled (and itself).

(Active, ReservedIM) => false,
(Active, ReservedFrz { .. }) => false,
(Active, Active) => true,
// Active can be replaced by Frozen, since it is not protected
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Active can be replaced by Frozen, since it is not protected
// Active can be replaced by Frozen, since it is not protected.

(Frozen, Frozen) => true,
(Frozen, Disabled) => true,
(Frozen, _) => false,
// Disabled can not be replaced by anything else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Disabled can not be replaced by anything else
// Disabled can not be replaced by anything else.

@@ -128,6 +128,22 @@ impl LocationState {
Ok(transition)
}

/// Like `perform_access`, but ignores the diagnostics, and also is pure.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Like `perform_access`, but ignores the diagnostics, and also is pure.
/// Like `perform_access`, but ignores the concrete error cause and also uses state-passing
/// rather than a mutable reference.

src/borrow_tracker/tree_borrows/tree.rs Show resolved Hide resolved
src/borrow_tracker/tree_borrows/tree.rs Show resolved Hide resolved
/// should have no children, but this is not checked, so that nodes
/// whose children were rotated somewhere else can be deleted without
/// having to first modify them to clear that array.
/// otherwise (i.e. the GC should have marked it as removable).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last line doesn't sound like a sentence...?

src/borrow_tracker/tree_borrows/tree.rs Outdated Show resolved Hide resolved
src/borrow_tracker/tree_borrows/tree.rs Outdated Show resolved Hide resolved
@RalfJung
Copy link
Member

Looks good, thanks. :)

@bors r+

@bors
Copy link
Contributor

bors commented Aug 28, 2024

📌 Commit ff0bc0f has been approved by RalfJung

It is now in the queue for this repository.

@bors
Copy link
Contributor

bors commented Aug 28, 2024

⌛ Testing commit ff0bc0f with merge d1aa077...

@bors
Copy link
Contributor

bors commented Aug 28, 2024

☀️ Test successful - checks-actions
Approved by: RalfJung
Pushing d1aa077 to master...

@bors bors merged commit d1aa077 into rust-lang:master Aug 28, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants