-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix duplicate tree checks within restic check
#2328
Fix duplicate tree checks within restic check
#2328
Conversation
Codecov Report
@@ Coverage Diff @@
## master #2328 +/- ##
==========================================
- Coverage 50.95% 46.88% -4.08%
==========================================
Files 178 178
Lines 14548 14544 -4
==========================================
- Hits 7413 6819 -594
- Misses 6069 6715 +646
+ Partials 1066 1010 -56
Continue to review full report at Codecov.
|
internal/checker/checker.go
Outdated
@@ -498,15 +487,21 @@ func filterTrees(ctx context.Context, backlog restic.IDs, loaderChan chan<- rest | |||
job treeJob | |||
nextTreeID restic.ID | |||
outstandingLoadTreeJobs = 0 | |||
processedTrees = make(map[restic.ID]bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The processedTrees
map will increase the memory usage of the check command as it is stored in addition to blobrefs.M
. It would be possible to add processedTrees
to c.blobrefs
and use it exclusively to store the tree ids. The main drawback of this would be that UnusedBlobs
has to look into both maps.
cb6738b
to
019751c
Compare
Have you collected any stats regarding performance impact? |
472d5b1
to
0003b64
Compare
I've noted that I forgot to mention the speedup seen for the 40 million files test set: It went from one week down to approximately one day. And now on to something more comparable: Preface: Github seems to mess up the displayed commit ordering, the actual order is as follows (oldest to newest). The benchmark just copies an existing snapshot and adds it with a new set of tags. I've mixed the commit list with the observed changes, the measurement data is shown below.
Thus from my point of view this leaves primarily the following commits without a demonstrable speed-up.
These commits improve data locality or avoid some unnecessary syscalls, respectively. As the accessed files come from the page cache there's not much that can be shown. |
601dd86
to
4e72353
Compare
I've split off the repository related changes into separate PRs. Apart from the commits "checker: Remove duplicate sets used to track existing blobs" and "checker: Unify processed tree and referenced blobs map" which are now merged into a single commit, the PR is largely unchanged, that is the old performance measurements still apply. The commit The changelog also refers to issues #2284 as this PR fixes the check command specific performance problem and #2523 tackles the index performance. |
1a95be0
to
6410284
Compare
Interesting work! Based on what you've seen in other PRs, do you think this PR makes changes that conflict with other PRs? Just in general/off the top of your head, I'm not suggesting you sift through them all. |
The only conflict I'm aware of is with #2523 which also removes some IDSets from the checker code. The solution is rather easy in that case: just use the code from this PR as it reduces the memory usage even a bit further. |
I just tested this myself on a repo that's 101 GB on disk and contains 4437 snapshots. Running Running it with this PR ( That's a huge improvement. I also noticed that restic was using around 395-399% CPU with this PR instead of 380% with master. The system is a small Dell server with 7200 RPM disks on hardware RAID, but clearly that's not a relevant bottleneck in this case. Memory is 8 GB and the CPU has four cores. The generated output from the two runs were identical in everything that matters (a bunch of unreferenced packs and one missing pack in this case). |
Another run with a 90+ GB repo, took 4 minutes instead of 31. Same machine, 1767 snapshots. |
@greatroar I cannot request a review from you using GitHub, but I'd like to ask if you could review this PR nonetheless? |
176a9da
to
f58b0e9
Compare
@@ -504,8 +493,21 @@ func filterTrees(ctx context.Context, backlog restic.IDs, loaderChan chan<- rest | |||
|
|||
for { | |||
if loadCh == nil && len(backlog) > 0 { | |||
// process last added ids first, that is traverse the tree in depth-first order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not depth-first order. It's more like post-order, if I understand the surrouding code correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the checker only uses a single treeLoader then the traversal order is a pre-order depth-first search. This line dequeues a child subtree and sends it to the tree loader. Once the tree is loaded the subtrees are added to the backlog and the tree node is sent via the outCh to the error checking. The child nodes are then loaded next from the backlog end and so forth.
With multiple treeLoader it gets a bit more chaotic but the overall pattern still holds. The archiver code uses a post-order depth-first traversal so the traversal order in the checker is still not a perfect fit, but at least it resembles the archiving process much more closely than the breath-first search that was used previously.
Even though the checkTreeWorker skips already processed chunks, filterTrees did queue the same tree blob on every occurence. This becomes a serious performance bottleneck for larger number of snapshots that cover mostly the same directories. Therefore decode a tree blob exactly once.
Avoid duplicate allocation of the Subtree list.
The blobRefs map and the processedTrees IDSet are merged to reduce the memory usage. The blobRefs map now uses separate flags to track blob usage as data or tree blob. This prevents skipping of trees whose content is identical to an already processed data blob. A third flag tracks whether a blob exists or not, which removes the need for the blobs IDSet.
Backups traverse the file tree in depth-first order and saves trees on the way back up. This results in tree packs filled in a way comparable to the reverse Polish notation. In order to check tree blobs in that order, the treeFilter would have to delay the forwarding of tree nodes until all children of it are processed which would complicate the implementation. Therefore do the next similar thing and traverse the tree in depth-first order, but process trees already on the way down. The tree blob ids are added in reverse order to the backlog, which is once again reverted when removing the ids from the back of the backlog.
This change only moves code around but does not result in any change in behavior.
The `DuplicateTree` flag is necessary to ensure that failures cannot be swallowed. The old checker implementation ignores errors from LoadTree if the corresponding tree was already checked.
f58b0e9
to
f6cfe40
Compare
If a data blob and a tree blob with the same ID (= same content) exist, then the checker did not report a data or tree blob as unused when the blob of the other type was still in use.
f6cfe40
to
13011db
Compare
13011db
to
9b0e718
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
What is the purpose of this change? What does it change?
Running
restic check
for large repositories is currently really slow. I had one run for an 8 TB repository with roughly 40 million files that took approximately one week to complete (Using GOMAXPROCS=1 to avoid interference with other processes on the server)This is partially caused by decoding for each snapshot, every referenced tree blobs. The integrity check later on checks a tree blob only a single time and ignores later repetitions. Decoding a JSON file is not particularly fast (and the necessary index lookups are even slower, see #2284 ).
This change ensures that tree blobs are only decoded once. As I had to modify the blobRefs map anyways, it is now also used to track whether a blob is listed in an index or not, thereby obsoleting the blob IDSet.
The checker now also traverses the tree blobs in depth-first order which better matches the traversal order of the backup code.
Was the change discussed in an issue or in the forum before?
The change is related but not identical to issue #2284 .
Checklist
changelog/unreleased/
that describes the changes for our users (template here)gofmt
on the code in all commits