Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of backup tree deletion #946

Merged
merged 3 commits into from
Jul 3, 2024
Merged

Implementation of backup tree deletion #946

merged 3 commits into from
Jul 3, 2024

Conversation

gustabowill
Copy link
Contributor

@gustabowill gustabowill commented Jun 21, 2024

This PR got bigger than I expected so I will try to explain the changes per commit.

In 129c672 is the implementation for deleting descendant backup nodes when a parent is deleted. To achieve this, I divided the current delete method of the Server into two methods:

  • delete which will now have some validations to make sure the backup is deletable (this used to be done later in the backup manager) as well as pass the whole chain for deletion inside a server level lock;
  • perform_delete will be doing some of the same work that the previous delete method was doing i.e. validating file permissions, holding the backup level lock and calling the BackupManager.delete. One thing to notice here is that this code used to acquire a server lock in the beginning, but it was released before the deletion even started , I also think the error message there was not entirely correct as it would say something like "there's another process for this backup id" when the lock was actually on the server, not on the backup. I "fixed" those things, but would like a check-up just to make sure.

In b0f38be I fixed the tests that were broken due to moving validations and locks around. I had to create new tests in the server logic as well as change some existing ones and remove some code from the backup manager tests.

In b41d409 I created new tests for the children deletion logic. It basically tests that the whole chain is being passed for deletion and also that a child deletion also deletes its reference in its parent.

References: BAR-181

barman/server.py Outdated
deleted = False
backups_to_delete = backup.walk_backups_tree()
for backup in backups_to_delete:
deleted = self.perform_delete_backup(backup)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to take a closer look at this PR.
IAC I wonder that this deleted variable will hold only the value of the "root" backup deletion.
What should we do if any child fails to be deleted, e.g. if we can't acquire a lock? At a first glance it seems we would have a leftover.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW by coincidence I hit an issue more or less like the one I brought up above. After that I was never able to run barman delete and had to physically remove the files from barman_home/barman_server/base directory to recover from that:

(barman) [vagrant@barmandevhost barman]$ barman list-backup pg17
pg17 20240624T215009 'child2.1' - Mon Jun 24 19:50:10 2024 - Size: 35.7 MiB - WAL Size: 0 B
pg17 20240624T214947 'child2' - Mon Jun 24 19:49:48 2024 - Size: 35.7 MiB - WAL Size: 64.0 MiB
pg17 20240624T214834 'child1.1.2' - Mon Jun 24 19:48:35 2024 - Size: 35.7 MiB - WAL Size: 112.0 MiB
pg17 20240624T214828 'child1.1.1' - Mon Jun 24 19:48:29 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240624T214819 'child1.1' - Mon Jun 24 19:48:21 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240624T214809 'child1' - Mon Jun 24 19:48:10 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240624T214757 'root_backup' - Mon Jun 24 19:48:00 2024 - Size: 355.7 MiB - WAL Size: 32.0 MiB
(barman) [vagrant@barmandevhost barman]$ barman delete pg17 root_backup
Backup 20240624T214757 has incremental backups which depent on it. Deleting all backups in the tree
Deleting backup 20240624T214828 for server pg17
Deleted backup 20240624T214828 (start time: Mon Jun 24 21:50:59 2024, elapsed time: less than one second)
Deleting backup 20240624T214834 for server pg17
Deleted backup 20240624T214834 (start time: Mon Jun 24 21:50:59 2024, elapsed time: less than one second)
Deleting backup 20240624T214819 for server pg17
Deleted backup 20240624T214819 (start time: Mon Jun 24 21:50:59 2024, elapsed time: less than one second)
Deleting backup 20240624T214809 for server pg17
Deleted backup 20240624T214809 (start time: Mon Jun 24 21:50:59 2024, elapsed time: less than one second)
Deleting backup 20240624T214843 for server pg17
EXCEPTION: Could not find backup_id 20240624T214843
See log file for more details.
(barman) [vagrant@barmandevhost barman]$ barman list-backup pg17
(barman) [vagrant@barmandevhost barman]$ barman list-backup pg17
pg17 20240624T215009 'child2.1' - Mon Jun 24 19:50:10 2024 - Size: 35.7 MiB - WAL Size: 0 B
pg17 20240624T214947 'child2' - Mon Jun 24 19:49:48 2024 - Size: 35.7 MiB - WAL Size: 64.0 MiB
pg17 20240624T214757 'root_backup' - Mon Jun 24 19:48:00 2024 - Size: 355.7 MiB - WAL Size: 240.0 MiB
(barman) [vagrant@barmandevhost barman]$ barman delete pg17 root_backup
Backup 20240624T214757 has incremental backups which depent on it. Deleting all backups in the tree
Deleting backup 20240624T214809 for server pg17
EXCEPTION: Could not find backup_id 20240624T214809
See log file for more details.
(barman) [vagrant@barmandevhost barman]$ barman delete pg17 20240624T215009
Deleting backup 20240624T215009 for server pg17
Deleted backup 20240624T215009 (start time: Mon Jun 24 21:52:31 2024, elapsed time: less than one second)
(barman) [vagrant@barmandevhost barman]$ barman list-backup all
pg17 20240624T214947 'child2' - Mon Jun 24 19:49:48 2024 - Size: 35.7 MiB - WAL Size: 64.0 MiB
pg17 20240624T214757 'root_backup' - Mon Jun 24 19:48:00 2024 - Size: 355.7 MiB - WAL Size: 240.0 MiB
(barman) [vagrant@barmandevhost barman]$ barman delete pg17 20240624T214947
Backup 20240624T214947 has incremental backups which depent on it. Deleting all backups in the tree
Deleting backup 20240624T214952 for server pg17
EXCEPTION: Could not find backup_id 20240624T214952
See log file for more details.
(barman) [vagrant@barmandevhost barman]$ barman list-backup all
pg17 20240624T214947 'child2' - Mon Jun 24 19:49:48 2024 - Size: 35.7 MiB - WAL Size: 64.0 MiB
pg17 20240624T214757 'root_backup' - Mon Jun 24 19:48:00 2024 - Size: 355.7 MiB - WAL Size: 240.0 MiB
(barman) [vagrant@barmandevhost barman]$ barman delete pg17 20240624T214757
Backup 20240624T214757 has incremental backups which depent on it. Deleting all backups in the tree
Deleting backup 20240624T214809 for server pg17
EXCEPTION: Could not find backup_id 20240624T214809
See log file for more details.
(barman) [vagrant@barmandevhost barman]$ barman list-backup all
pg17 20240624T214947 'child2' - Mon Jun 24 19:49:48 2024 - Size: 35.7 MiB - WAL Size: 64.0 MiB
pg17 20240624T214757 'root_backup' - Mon Jun 24 19:48:00 2024 - Size: 355.7 MiB - WAL Size: 240.0 MiB

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can treat this whole deletion of a chain as a single unit, something like transactions in postgres, all or nothing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to take a closer look at this PR. IAC I wonder that this deleted variable will hold only the value of the "root" backup deletion. What should we do if any child fails to be deleted, e.g. if we can't acquire a lock? At a first glance it seems we would have a leftover.

That's something I was thinking as well. I mean, the errors will be logged so the user will know which backups couldn't be deleted and could potentially delete them manually. On the other hand, if we stop the loop in case any backup fails along the way it would also require the user to manually fix whatever is wrong and try again. Both requires a manual intervention, but thinking more about it, maybe it's better to force him to delete everything before deleting the parent to avoid leftovers, as you said.

Copy link
Contributor Author

@gustabowill gustabowill Jun 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@barthisrael I think this error that you faced is because we are not updating the children_ids in the backup_info as children are deleted. So when you try to run the delete for the second time it tries to find backups that were already deleted but still referenced in the parent file. That's something important that I was not even remembering: update the references as backups are deleted 😵

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed some changes to stop the process if an incremental fails to be deleted and also a change to update the children_ids they are deleted.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to take a closer look at this PR. IAC I wonder that this deleted variable will hold only the value of the "root" backup deletion. What should we do if any child fails to be deleted, e.g. if we can't acquire a lock? At a first glance it seems we would have a leftover.

The lock issue you raised is interesting, and I think that we should look into locking sequence and be sure that a lock is acquired and kept for the whole chain-delete action. avoiding other interactions from external processes.
I'm just thinking out loud here btw.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can treat this whole deletion of a chain as a single unit, something like transactions in postgres, all or nothing.

This is potentially a good idea if we can implement something like a dry run of the deletion of the whole chain...
Otherwise can potentially become something that is going to consume a lot of disk space...

Copy link
Contributor Author

@gustabowill gustabowill Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lock issue you raised is interesting, and I think that we should look into locking sequence and be sure that a lock is acquired and kept for the whole chain-delete action. avoiding other interactions from external processes. I'm just thinking out loud here btw.

Do you think we need a lock on every single backup beforehand or should a lock on the server be enough? I was thinking of acquiring a server lock on the start and then procede to acquire backup locks on demand as incrementals are being deleted.

As I see in the code both backup and backup delete actions require a server lock acquisition so none of those actions will be able to run while the chain is being deleted. The server lock will protect the whole chain and the backup lock will protect each backup during its deletion. At the end I think that would server the same purpose as holding all locks at once, right? I don't know if I'm missing something.

What do you guys think? @gcalacoci @barthisrael @andremagui

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to know more about locking to make meaningful suggestions :D but i feel that acquiring just one lock would be enough. As it is, in the start of delete_backup method in the server there is a check if its possible to acquire a server lock and then releases it, but in fact just one lock is created when deleting backups just before the call to backup_manager. So does it make sense to have one lock for all deletions or a lock for each deletion?

Sorry, instead of helping i just added more questions!!!!

@gustabowill gustabowill force-pushed the dev/bar-181 branch 3 times, most recently from 9d5eba6 to ce1b93e Compare June 25, 2024 01:27
Copy link
Contributor

@barthisrael barthisrael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change introduced here look good and aligned with what we discussed during our call.

I think we are only missing the "general server lock" when deleting the backups. Other than that we are doing what is expected:

  • Delete all backups in the tree, updating the references between nodes along the way;
  • If an issue is faced while deleting a backup of the three, stop and return an error

barman/server.py Outdated Show resolved Hide resolved
barman/server.py Outdated Show resolved Hide resolved
barman/server.py Outdated Show resolved Hide resolved
@gustabowill gustabowill force-pushed the dev/bar-181 branch 4 times, most recently from 6a732c0 to 860d7dd Compare June 27, 2024 16:15
@gustabowill gustabowill marked this pull request as ready for review June 27, 2024 16:17
@gustabowill gustabowill requested a review from a team as a code owner June 27, 2024 16:17
@gustabowill gustabowill changed the title Initial imlementation of backup tree deletion Implementation of backup tree deletion Jun 27, 2024
@barthisrael
Copy link
Contributor

This seems to be working fine:

  • My backups (you can guess the relationship based on their names 😆 ):
$ barman list-backups pg17
pg17 20240628T000532 'yet_another_full' - Thu Jun 27 22:05:34 2024 - Size: 251.7 MiB - WAL Size: 0 B - WAITING_FOR_WALS
pg17 20240628T000524 'another_incremental' - Thu Jun 27 22:05:26 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000511 'another_full' - Thu Jun 27 22:05:13 2024 - Size: 251.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000443 'child_2_1' - Thu Jun 27 22:04:44 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000436 'child_1_2' - Thu Jun 27 22:04:37 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000433 'child_1_1' - Thu Jun 27 22:04:34 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000419 'child_2' - Thu Jun 27 22:04:20 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000416 'child_1' - Thu Jun 27 22:04:17 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000321 'full_1' - Thu Jun 27 22:03:25 2024 - Size: 251.7 MiB - WAL Size: 96.0 MiB
  • Backup deletion starting from a node in the middle of the tree:
$ barman delete pg17 child_2
WARNING: Backup 20240628T000419 has incremental backups which depend on it. Deleting all backups in the tree
Deleting backup 20240628T000443 for server pg17
Deleted backup 20240628T000443 (start time: Fri Jun 28 00:08:05 2024, elapsed time: less than one second)
Deleting backup 20240628T000419 for server pg17
Deleted backup 20240628T000419 (start time: Fri Jun 28 00:08:05 2024, elapsed time: less than one second)
$ barman list-backups pg17
pg17 20240628T000532 'yet_another_full' - Thu Jun 27 22:05:34 2024 - Size: 251.7 MiB - WAL Size: 0 B - WAITING_FOR_WALS
pg17 20240628T000524 'another_incremental' - Thu Jun 27 22:05:26 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000511 'another_full' - Thu Jun 27 22:05:13 2024 - Size: 251.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000436 'child_1_2' - Thu Jun 27 22:04:37 2024 - Size: 35.7 MiB - WAL Size: 64.0 MiB
pg17 20240628T000433 'child_1_1' - Thu Jun 27 22:04:34 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000416 'child_1' - Thu Jun 27 22:04:17 2024 - Size: 35.7 MiB - WAL Size: 64.0 MiB
pg17 20240628T000321 'full_1' - Thu Jun 27 22:03:25 2024 - Size: 251.7 MiB - WAL Size: 96.0 MiB
  • Backup deletion starting from a leaf in the tree:
$ barman delete pg17 child_1_2
Deleting backup 20240628T000436 for server pg17
Deleted backup 20240628T000436 (start time: Fri Jun 28 00:08:58 2024, elapsed time: less than one second)
$ barman list-backups pg17
pg17 20240628T000532 'yet_another_full' - Thu Jun 27 22:05:34 2024 - Size: 251.7 MiB - WAL Size: 0 B - WAITING_FOR_WALS
pg17 20240628T000524 'another_incremental' - Thu Jun 27 22:05:26 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000511 'another_full' - Thu Jun 27 22:05:13 2024 - Size: 251.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000433 'child_1_1' - Thu Jun 27 22:04:34 2024 - Size: 35.7 MiB - WAL Size: 96.0 MiB
pg17 20240628T000416 'child_1' - Thu Jun 27 22:04:17 2024 - Size: 35.7 MiB - WAL Size: 64.0 MiB
pg17 20240628T000321 'full_1' - Thu Jun 27 22:03:25 2024 - Size: 251.7 MiB - WAL Size: 96.0 MiB
  • Backup deletion starting from the root:
$ barman delete pg17 full_1
WARNING: Backup 20240628T000321 has incremental backups which depend on it. Deleting all backups in the tree
Deleting backup 20240628T000433 for server pg17
Deleted backup 20240628T000433 (start time: Fri Jun 28 00:09:34 2024, elapsed time: less than one second)
Deleting backup 20240628T000416 for server pg17
Deleted backup 20240628T000416 (start time: Fri Jun 28 00:09:34 2024, elapsed time: less than one second)
Deleting backup 20240628T000321 for server pg17
Delete associated WAL segments:
	000000010000000300000039
	00000001000000030000003A
	00000001000000030000003B
	00000001000000030000003C
	00000001000000030000003D
	00000001000000030000003E
	00000001000000030000003F
	000000010000000300000040
	000000010000000300000041
	000000010000000300000042
	000000010000000300000043
	000000010000000300000044
	000000010000000300000045
	000000010000000300000046
	000000010000000300000047
	000000010000000300000048
Deleted backup 20240628T000321 (start time: Fri Jun 28 00:09:34 2024, elapsed time: less than one second)
$ barman list-backups pg17
pg17 20240628T000532 'yet_another_full' - Thu Jun 27 22:05:34 2024 - Size: 251.7 MiB - WAL Size: 0 B - WAITING_FOR_WALS
pg17 20240628T000524 'another_incremental' - Thu Jun 27 22:05:26 2024 - Size: 35.7 MiB - WAL Size: 32.0 MiB
pg17 20240628T000511 'another_full' - Thu Jun 27 22:05:13 2024 - Size: 251.7 MiB - WAL Size: 32.0 MiB

Copy link
Contributor

@barthisrael barthisrael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work here!

The implementation is clean, and seems to be working fine.

Now the usual nitpicking -- otherwise I wouldn't be Israel 😆

  1. I put a couple of suggestions regarding docstrings of delete_backup and perform_delete_backup
  2. Let's add References: BAR-181 to the PR description
  3. Let's add References: BAR-181 to the 3 commits
  4. Let's modify the first commit message a bit, just so we explain the "why" behind the following couple of changes:

1 - Move some validations (check keep, check redundancy) were
moved from the backup manager to the server as they now will
need to happen in the very first stage of the deletion.

3 - The whole chain deletion will happen inside a server lock and
every individual backup deletion will happen inside backup lock.

Points 1. would make it easier for us to understand the code, if we don't look it for a while.
Points 2., 3. and 4. are relevant if we need to inspect the commit log some day in the future.

barman/server.py Outdated Show resolved Hide resolved
barman/server.py Outdated Show resolved Hide resolved
When a postgres full backup with incrementals is deleted
we have to make sure that all its descendants are deleted
along with it. Similarly, when a child backup is deleted
we have to make sure that its reference is removed from
is parent. To achieve this, the following changes were made:

1 - Some validations (check keep and check redundancy) were
moved from the backup manager to the server as they now will
need to happen in the very first stage of the deletion, which
happens in the server context. This becomes required once we have
to make sure that the parent itself can be deleted before proceding
with the deletion of any of its children.

2 - Split the 'delete' method of the server into two separate methods so
it gets easier to distinguish the prepare phase (where children are gathered) and
the actual delete phase (where the backup manager is called).

3 - The whole chain deletion will happen inside a server lock and every
individual backup deletion will happen inside backup lock. This ensures that
the whole process is done consistently, thus avoiding errors in the
middle which would require the user to try again.

References: BAR-181

Signed-off-by: Gustavo William <[email protected]>
As some validations had to be moved from the backup manager to the
server, some tests which were making use of those validations also
needed to change accordingly. Also, tests involving the lock phase
also changed as the locking mechanism changed slightly too.

References: BAR-181

Signed-off-by: Gustavo William <[email protected]>
Adding tests to test the logic of deleting children backups
when a parent is deleted. A new test was added to ensure
that the children are indeed being deleted along with the parent.
A current test was modified to make sure that, once a child has
been deleted, its reference is completely removed from its parent.

References: BAR-181

Signed-off-by: Gustavo William <[email protected]>
@gustabowill
Copy link
Contributor Author

@barthisrael Thanks, I completely forgot about the references 😆. Just pushed the changes you asked for.

Copy link
Contributor

@barthisrael barthisrael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

Copy link
Contributor

@gcalacoci gcalacoci left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's wait for @martinmarques review before merging. but in the meantime LGTM

@barthisrael
Copy link
Contributor

A random thought that just came up: maybe we need to handle this also in Barman cloud?

@gcalacoci
Copy link
Contributor

A random thought that just came up: maybe we need to handle this also in Barman cloud?

IMHO barman cloud and incremental is something that is going to come later. we first focus on local, and then add cloud. IIRC we discussed this with @martinmarques during the design.

@barthisrael
Copy link
Contributor

IMHO barman cloud and incremental is something that is going to come later. we first focus on local, and then add cloud. IIRC we discussed this with @martinmarques during the design.

I really don't recall this specific detail, sorry. IAC what you said makes sense to me, as we discussed about implementing this in incremental steps.

@barthisrael barthisrael merged commit 4ee8f74 into master Jul 3, 2024
7 of 8 checks passed
@barthisrael barthisrael deleted the dev/bar-181 branch July 3, 2024 15:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants