-
Notifications
You must be signed in to change notification settings - Fork 175
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
better documentation of what happens with Remove Snapshot? #875
Comments
I believe the links will continue to behave as you describe. In a traditional Unix filesystem, files can be pointed-to by multiple 'links' (aka 'directory entries', filenames), as long as there's at least one. When the number goes to zero, the file becomes 'deleted' (nothing points to the inode, so it's freed). |
For files that haven't changed, that makes sense. But, for the simplest case, what happens if there are two snapshots and I delete the first one? I assume in that case that all links from the second to the first are replaced by actually copying the files. Otherwise the second snapshot would be left with almost nothing! |
When a file is initially created it is effectively created as a hard link to the actual file contents. When another hard link to that file is created then it is just another hard link to the same contents. There is no difference between the first link and the second. So you can delete either one of them and the other will still be a link to the original contents. The contents will not be removed until both links have been deleted. |
@colinl so you are saying that the very first snapshot is itself made of hard links and the file contents are effectively separate? And that when any snapshot is deleted, BIT knows to check whether each deleted link is present in any other snapshot, and if so leave the contents alone but if not, delete the contents? |
@wolftune no, that is not really what I am saying, though the result is the same. What I described is how the linux file system works, it is nothing to do with BIT. You can try it yourself. Create a file, make a hard link to the file, and delete the original and the second one will still be there. All that happens when you delete the original is that you delete the original link to the file. In fact when you create a hard link to an existing file there is absolutely no difference between the original link and the new one (apart from being in different places of course and possibly having different names). There is no way that you can tell which one was created first for example (as far as I know). |
So, the point being that data is on a storage device regardless of whether there are hard links, but the existence of any hard links within any particular media device makes sure that the file system will not overwrite the data, but if there are no remaining hard links, then the file system allows that data to be overwritten? |
@wolftune, I think you still have not fully got it. There is no such thing as a file without any hard links. When we talk about a file what we really mean is a hard link to the file contents. When a file is initially created the file contents are written and a hard link is made to those contents. When we talk about deleting a file that is strictly a misuse of words. What we really mean by deleting a file is deleting a link to the file contents. If there is only one link to the file (as is the case normally) then file contents will also be deleted (well released actually, the area of disc is not overwritten, it is just made available for reuse). If there is more than one link, as is the case if a file is created then a hard link made (a second snapshot for example), then the second link will still exist so the contents will not be deleted. |
Right, you're confirming what I wrote above.
I didn't say that. I referenced "data" as in the binary bits on some storage media in some non-random state that captures the "contents" (to use your wording). So, to clarify finally (I hope): all the redundancy in BIT snapshots are in the form of duplicated hard links. The file system recognizes them all and knows not to release the relevant area of the storage medium unless all of the hard links are deleted. To reiterate: after a new snapshot is completed, it doesn't rely at all on any previous snapshot. I could even manually remove older snapshots outside of BIT. It's the file system that knows that some of the deletions of hard-links do not release the storage area for rewrite because the file system is aware of the remaining hard-links in the new snapshot. Do I get it now? |
Sounds like you *do* get it.
An analogy I have found helpful is that of a classic (real-world) library,
with books on shelves by number, and indexed by a card catalog.
There can be any number of cards in the catalog (analogous to filenames in
directories) which 'point to' the same book on a shelf. e.g., one card is
alphabetic by author, another card may be in a subject-matter catalog,
another in a different subject matter catalog. All pointing to the same
book on the shelf.
As long as one of those cards remains, the book remains also. If/when the
last card pointing to a book is removed, then the librarian (the filesystem
algorithms) also discards the book, making the shelf space it occupied
available for use when a new book needs to be stored.
For index cards which point to the same book, they have no effect on one
another. New index cards can be added (e.g. if someone wanted to build a
card index based on the books' cover colors), and others removed (e.g. when
the book is deemed irrelevant to one of its subject indices), and doing so
alters neither the book nor the other index cards which point to it.
So, each time BiT asks rsync to add a new snapshot, rsync checks whether
the file has changed since the last snapshot. If no change, rsync creates a
new hard link to the existing copy (a very efficient action), rather than
transferring a new copy (and redundantly use new space for it). After a
while, there are links to that same file in a number of snapshots made over
time. (again, we are presuming the file is unchanged in this example). So,
removing the *first* link to the file (i.e. deleting the original snapshot)
doesn't affect the stored file at all. It only reduces the count of links
pointing to it.
Intuitively, we tend to think of the file as "living" in the folder of the
first snapshot (it took me a long time to get my head around this). But
once a second snapshot points links to the same file, those links exist as
equals - neither one having any more 'ownership' of the file than the
other. Chronological order of the creation of the links does not matter for
the continued presence of the file; it's only necessary that there remain
at least one link to it in existence.
On Wed, Feb 28, 2018 at 12:35 PM Aaron Wolf ***@***.***> wrote:
Right, you're confirming what I wrote above.
There is no such thing as a file without any hard links.
I didn't say that. I referenced "data" as in the binary bits on some
storage media in some non-random state that captures the "contents" (to use
your wording).
So, to clarify finally (I hope): all the redundancy in BIT snapshots are
in the form of duplicated hard links. The file system recognizes them all
and knows not to release the relevant area of the storage medium unless
*all* of the hard links are deleted.
To reiterate: after a new snapshot is completed, it doesn't rely *at all*
on any previous snapshot. I could even *manually* remove older snapshots
outside of BIT. It's the file system that knows that some of the deletions
of hard-links do not release the storage area for rewrite because the file
system is aware of the remaining hard-links in the new snapshot.
Do I get it now?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#875 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AH88XJfeFXeZzmDdKwXR-3IQQwBgQi3fks5tZYwpgaJpZM4SV4RL>
.
--
Gary Dobbins
(if sent from the phone, typos due to fat thumbs and autocorrect)
|
That is actually intuitive to me, and that's where I get worried. The librarian has to be a superhuman computer to know when removing a card that it's the last card. There's nothing on the card itself to tell you that. I'm sure this is where some thing could go wrong, like a hard link to data across drives or something. In this case, the conclusion I make is that the file system is that superhuman computer that actually does the intensive process of knowing whether a removed link is the last one. |
No analogy is perfect, and mine is no exception. Trust, but verify: This behaviour is hinted at in the Posix API where the function call to delete a file is called 'unlink()' |
Okay, so besides my edification, it seems reasonable to ask if the BIT docs clarify that even though each snapshot is incremental (fast backups, only needs to copy changed files) each snapshot has no reliance on any other (or some similar clarification about the situation). It would be nice to avoid the worry I started with about wondering about such dependencies. |
It's easy to make the mistake of thinking that a file's identity is its path and name. Particularly because that is actually true on FAT file systems. But on Linux/Unix file systems (also on NTFS but you have to dig a lot to verify it), a file's identity is its inode. Not its path and name. They are just tools (called hard links) for finding the inode. Just like a screw doesn't care which of your two compatible screwdrivers you use on it, an inode doesn't care - or, for that matter, know - which hard link you use to find it. Nor do the hard links know anything about each other, not even that they exist, so they have no precedence. The inode does, however, know how many hard links point at it. That's a counter in the inode, that is updated whenever a hard link is added or deleted. Note: symbolic links are different. I won't go into them here. Now when looking at what BIT does, you can choose to look at either the process (an incremental backup) or the result (a full backup). Both are correct, but if the backups will run at a time of low or no other activity - or run very fast because they'll have very little to do - then I think the latter is more useful. |
I've searched around but feel unsure still. I'd like to understand what happens if I remove snapshots 2, 4, and 6 from a list of snapshots 1 through 7. Somehow, all the links and whatever are readjusted fully so that each snapshot is guaranteed to be as if it really were done over the remaining next-older one? And I still have an effectively complete backup if I do that sort of removal and then make a new snapshot?
The text was updated successfully, but these errors were encountered: