Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: copytruncate could make a CoW clone on supported filesystems. #538

Open
chder opened this issue Jul 20, 2023 · 7 comments

Comments

@chder
Copy link

chder commented Jul 20, 2023

Basically was looking for an option to do the same as cp --reflink src dst when both are on the same supported filesystem.
This reduces the total free disk space required to duplicate the log before clearing the source file and speeds things up.

On an xfs fs I ran a little test:

Copy with reflink set

[root@rocky9-020 log]# cp --reflink messages messages-clone
[root@rocky9-020 log]# ls -al messages{,-clone}
-rw-------. 1 root root 71615 Jul 20 23:44 messages
-rw-------. 1 root root 71615 Jul 20 23:49 messages-clone
[root@rocky9-020 log]#

Shows that the extents are shared:

[root@rocky9-020 log]# filefrag -e -k messages{,-clone}
Filesystem type is: 58465342
File size of messages is 71615 (72 blocks of 1024 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       3:      45144..     45147:      4:             shared
   1:        4..       7:      45152..     45155:      4:      45148: shared
   2:        8..      11:      45160..     45163:      4:      45156: shared
   3:       12..      15:      45168..     45171:      4:      45164: shared
   4:       16..      23:      45176..     45183:      8:      45172: shared
   5:       24..      27:      45188..     45191:      4:      45184: shared
   6:       28..      35:      45196..     45203:      8:      45192: shared
   7:       36..      47:      45212..     45223:     12:      45204: shared
   8:       48..      51:      45228..     45231:      4:      45224: shared
   9:       52..      55:      45236..     45239:      4:      45232: shared
  10:       56..      63:      45244..     45251:      8:      45240: shared
  11:       64..      67:      45256..     45259:      4:      45252: shared
  12:       68..      71:      48012..     48015:      4:      45260: last,shared,eof
messages: 13 extents found
File size of messages-clone is 71615 (72 blocks of 1024 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       3:      45144..     45147:      4:             shared
   1:        4..       7:      45152..     45155:      4:      45148: shared
   2:        8..      11:      45160..     45163:      4:      45156: shared
   3:       12..      15:      45168..     45171:      4:      45164: shared
   4:       16..      23:      45176..     45183:      8:      45172: shared
   5:       24..      27:      45188..     45191:      4:      45184: shared
   6:       28..      35:      45196..     45203:      8:      45192: shared
   7:       36..      47:      45212..     45223:     12:      45204: shared
   8:       48..      51:      45228..     45231:      4:      45224: shared
   9:       52..      55:      45236..     45239:      4:      45232: shared
  10:       56..      63:      45244..     45251:      8:      45240: shared
  11:       64..      67:      45256..     45259:      4:      45252: shared
  12:       68..      71:      48012..     48015:      4:      45260: last,shared,eof
messages-clone: 13 extents found

And then no longer shared after truncating the original file:

[root@rocky9-020 log]# filefrag -e -k messages{,-clone}
Filesystem type is: 58465342
File size of messages is 0 (0 blocks of 1024 bytes)
messages: 0 extents found
File size of messages-clone is 71615 (72 blocks of 1024 bytes)
 ext:     logical_offset:        physical_offset: length:   expected: flags:
   0:        0..       3:      45144..     45147:      4:            
   1:        4..       7:      45152..     45155:      4:      45148:
   2:        8..      11:      45160..     45163:      4:      45156:
   3:       12..      15:      45168..     45171:      4:      45164:
   4:       16..      23:      45176..     45183:      8:      45172:
   5:       24..      27:      45188..     45191:      4:      45184:
   6:       28..      35:      45196..     45203:      8:      45192:
   7:       36..      47:      45212..     45223:     12:      45204:
   8:       48..      51:      45228..     45231:      4:      45224:
   9:       52..      55:      45236..     45239:      4:      45232:
  10:       56..      63:      45244..     45251:      8:      45240:
  11:       64..      67:      45256..     45259:      4:      45252:
  12:       68..      71:      48012..     48015:      4:      45260: last,eof
messages-clone: 13 extents found
[root@rocky9-020 log]#
@kdudka
Copy link
Member

kdudka commented Jul 21, 2023

I think this may reduce the I/O transfer but I do not think it could save any disk space in long term. copytruncate truncates the original log file to zero after the copy, which frees the disk space again. Note that the implementation of coyptruncate is racy and may cause data loss. There is a stale pull request to implement copyreduce with a more useful semantics: #361

@chder
Copy link
Author

chder commented Jul 21, 2023

The disk space saving I mentioned is just for during the copy.

My scenario is a system that's near full and a large log file that does ultimately compress well.
If the log is 1000M and there's only 500M remaining, the copy part of copytruncate fails, even if the end result would have compressed well enough to still fit in that 500M space.

Using CoW/reflink might pair well with that copyreduce mode too. They're addressing the gap in the time it takes to copy by precisely punching holes in the original file but still needs enough space and time for a full duplicate of the original log.

@Forza-tng
Copy link

I think this may reduce the I/O transfer but I do not think it could save any disk space in long term. copytruncate truncates the original log file to zero after the copy, which frees the disk space again.

Indeed. The advantage with a reflink copy on Btrfs, XFS (and soon ZFS?) is that a reflink copy is near instant, even on very large files. It also reduce amount of writes needed, even temporarily, while compressing the files (if using compress or delay compress).

Note that the implementation of coyptruncate is racy and may cause data loss.

Unfortunate as it is, we can't always make sure loggers have good reload functions.

There is a stale pull request to implement copyreduce with a more useful semantics: #361

How about using renameat2() system call to swap out the logfile with an empty one. I believe it is an atomic syscall, so it should minimise risk of data loss.

renameat2() RENAME_EXCHANGE

Atomically exchange oldpath and newpath. Both path names must exist but may be of different types (e.g., one could be a non-empty directory and the other a symbolic link).

@kdudka
Copy link
Member

kdudka commented Oct 18, 2023

@Forza-tng logrotate atomically renames the log file by default. The problem is that the daemon will continue to write to the already rotated log file as long as the original log file descriptor is kept open. On a POSIX system, a rename operation does not affect open file descriptors. And that is exactly the reason why the copytruncate directive is provided by logrotate.

@Forza-tng
Copy link

@Forza-tng logrotate atomically renames the log file by default. The problem is that the daemon will continue to write to the already rotated log file as long as the original log file descriptor is kept open. On a POSIX system, a rename operation does not affect open file descriptors. And that is exactly the reason why the copytruncate directive is provided by logrotate.

I might have misunderstood how renameat2() works, but the the way I understand it, it keeps the inode of the original file, unlike when you use mv.

The process would be something like this:

  1. logfile_a with inode 123
  2. touch tmpfile
  3. renameat2() tmpfile logfile_a
  4. mv tmpfile logfile_a.2
    We should now have a new, empty logfile_a which still has inode 123. The logger's file descriptor remain unchanged and will continue to write to logfile_a.

If we use mv logfile_a logfile_a.2, the inode and fd would simply follow to logfile_a.2, which is the problem we're trying to solve.

@kdudka
Copy link
Member

kdudka commented Oct 18, 2023

Nope. renameat2 changes the mapping of file names to inodes. You cannot use it to swap the content behind inodes.

@Forza-tng
Copy link

Nope. renameat2 changes the mapping of file names to inodes. You cannot use it to swap the content behind inodes.

You are correct. I did some tests using https://gist.github.com/eatnumber1/f97ac7dad7b1f5a9721f#file-renameat2-x86_64 which prove your point. I'm sorry for the noise caused 🙏

The original idea using reflinks would still be very beneficial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants