It allows programs to be run in response to sysevents, some of which are generated by ZFS and some of which are generated by other parts of the system (e.g., device hotplug).
If you are on Linux I can highly recommend ZFS and Minio for S3 like storage. ZFS for local storage.
Minio is barely documented. I had to ask in Slack to interpret what it meant when minio said "your cluster is 5 red and 7 yellow" as the colours aren't event documented.
Every minio cluster I hosted had dataloss. Each one has lead to a reported issue on their GitHub that hasn't been closed to date. Nothing about recovery is documented. Documentation is slim in general. Really confused why you're naming it next to ZFS which is sublimely documented and withstood the test of time.
I'd advise something battletested through time like Ceph for object storage instead.
Even Ceph has its warts for distributed object storage as does basically... anything in existence worth considering that's OSS-ish (GlusterFS, HDFS, Lustre) but comparing Minio to ZFS is confusing given how drastically different in purpose, functionality, and engineering hardening has happened between the two projects. With that said, the AWS S3 team really is impressive in what they've built out and deserves more shout-outs from people outside Amazon.
However, I'm slightly concerned that we haven't tested it enough, and that doc and support may prove to be lacking (like in your case) when we hit edge cases and failure scenarios.
No LTS version. Bugs will be fixed in a new version, which may include new bugs. Some upgrades need full cluster restart. If you ask too many question you may get "it's better start a subscription".
ZFS is fine, but it is overkill for most home applications and it has a pitfall related to extensibility.
So it really depends on your needs. Statements as "ZFS is the best filesystem" are so meaningless.
P.S. SSD caching often has no tangible practical benefit for most applications. More RAM is often the better investment. As is with any filesystem.
This is because ZFS is cached by the ARC, not the normal page cache. ARC is weird, and operates in 8K blocks (like sparc page size), rather than 4K pages. Zero-copy things like sendfile depend on referencing pages in the page cache, and have never been adapted to deal with ARC. So making sendfile zero-copy with ZFS is a hard project that would involve either teaching ARC to "loan" pages to sendfile, or ripping out ARC caching from ZFS and making it use the same page cache that all other filesystems use.
It all depends on the circumstances and requirements: (small) business application or some home-build NAS?
In your case, how much does it matter that some node experiences bitrot, and how big are those risks?
For our use, bit rot is pretty low risk. We have tooling to catch corrupted files (and it happens surprisingly rarely). We don't care about any of the raid like features (if a drive dies, we tell clients to get their video elsewhere).
For our use case, ZFS would be attractive mainly because of the ability to keep metadata in the L2 ARC. One of our bigger sources of P99 latency is uncached metadata reads from mechanical drives. Our FS guys are currently solving that problem in other ways.
Kind of: their edge-cache appliances run FreeBSD. IIRC they run Linux on their Amazon cloud for all their 'internal' stuff.
If you do some searches there's some good presentations on their work on getting encrypted streaming to go Very Fast:
I started doing that because I saw corruptions on magnetic disks at home. Some files were silently corrupted. I had no redundancy. I didn't know which files were "good" either.
So now I have one multi disk machine running FreeBSD with zfs. Works well. Hardware isn't especially fancy. In the time since I have seen it catch hardware failures. I have seen it call out specific files as corrupt. This is a huge improvement over how I saw bad disks surface in the past.
Almost none of them have even ECC memory.
If you care so much about data integrity, please start there.
If you are building a home NAS from quality rackmounted server parts, then maybe you are fine. But this was not an option for me, as I did not have dedicated server space and needed something quiet. And once you have to start with to mess with desktop cases full of hard drives, it is very easy to get corrupted data.
I run ZFS on my home NAS. Yes, it (probably) eats too much RAM, and it (probably) not the fastest thing, but at least my data is intact. I had to piece together my photo collection from multiple backups, it was not fun at all.
A plain statement like this doesn't prove anything.
I had a few cases of data damage. One of the worst ones was when I moved to a different place, and had to leave much of my stuff behind. I had half a dozen or so smaller drives in my PC (SATA + IDE) which were working just fine. I got about three new drives (I believe SATA 1TB?), installed them into PC, and copied all the files to the new drives. I then left the old PC, and only took the new drives with me.
This was Linux, ext3 and JBOD (no RAID or anything). I did not have a good filing system, so some data got copied multiple times.
Once I got into new place, I bought a new PC and installed the hard drives I had. I have noticed that some files are damaged. I had some checksumming scripts, and I was recording checksums, and found out that some checksums would not match - and each copy had a different set of damaged files. I ended up cherry-picking files from multiple copies to assemble a good set.
I don't know the exact reason, but I am fairly sure they were transfer errors. The original PC was working fine for a long time, so source data was likely clean. The new PC did not show any more data corruptions, and it was reading the same data every time. So my theory is either transfer errors while copying files, or silent data corruption on disk.
I don't know of any solutions that would have helped here except custom data checksum tools or ZFS (I suppose btrfs might have helped too, but I heard too many horror stories about it).
I actually had this come up the second time: when I moved again, and built another NAS box (desktop motherboard, 5x 4TB drives with ZFS), I started copying files off the old SATA drives (ext3) and saw the data transfer mismatch. It was pretty freaky: "rsync" the file, flush caches, md5sum source and destinations -- and they are different. kernel log was quiet, memtest was not showing any errors, so I got a beefier power supply and replaced all the SATA cables. This helped.
Yeah, not having silent data corruption is "overkill", sure. /s Why not use ZFS? It takes 15 seconds to install, and its CLI is fairly intuitive. Works fine. Costs $0. Why not, even for "home" applications?
I could see how it could be unsuitable for "entreprise" applications where there are strict performance requirements etc, but for home, I wish I could use ZFS everywhere.
It’s way at the far end of the “must RTFM to use safely, and then probably brush up on it again before actually doing anything unless you use it daily” spectrum of intuitiveness.
I like my ZFS mass storage volumes for my home server. I worry I’ll screw them up and/or burn an hour googling and reading the manual every time I have to touch them, though.
At work, our servers will get into a state where they just hang for a span from minutes to hours while BTRFS does "something". I'm not one of the admins though, so I don't know the exact details. I just know that this is a vendor-supported configuration and the vendor has been unable to tell us why this happens or offer any solution that makes it not happen. Our answer to this issue has been to rebuild servers with ext4 when things get bad. This has happened on multiple servers hosting different applications - the only commonality seems to be that write-heavy loads get it into this state. Servers that just have their OS on BTRFS but do all of their work on NFS volumes are fine.
At home, I once rebooted my OpenSuse Tumbleweed laptop and ended up with a BTRFS filesystem that couldn't be mounted RW. Fortunately I was able to mount it RO after booting off installation media and copy my data off, but I couldn't get the filesystem back into a state where it could be mounted RW. I ended up reinstalling. I never did figure out the root cause, but I suspect that some BTRFS-related process was running when I rebooted.
On the flip side, ZFS has never let me down in this way, but to be fair I've never subjected it to the same use-cases. Unfortunately, the inability to resize/reshape the filesystem is an issue for me. I believe that it's being worked on, but I don't think that work is production-ready yet.
Last I checked BTRFS RAID5/6 was a dumpster fire and unusable in production. Have they actually open sourced the ability to fix bitrot detection with mdraid? If not, it's kind of irrelevant.
So... once again down votes without response - BTRFS raid still isn't recommended and the file healing isn't compatible with MDRAID I assume and you just don't like the fact I pointed it out? The "I'm downvoting because you pointed out a flaw in my logic" @HN is disappointing.
This is strictly speaking to how it deals with data and parity, the implementations are obviously different.
RAID-1 would be a mirror in ZFS parlance.
Even if it has a gui it's probably a non starter unless there are literally 2 options
You can check the status of each feature here: https://btrfs.wiki.kernel.org/index.php/Status
None of the options are marked experimental. (Specific features are marked unstable on the status page)
"Under heavy development" does not mean anything about stability. The kernel itself is under heavy development. ZoL is under heavy development. The disk format is stable, which is what matters.
SUSE provides commercial support for btrfs and uses it as the default. That's pretty much as non-experimental as is gets.
Update - also it appears Suse enterprise uses btrfs for the root os filesystem but xfs for everything else including by default /home. To me, this seems telling? If it’s so solid why not use it for /home?
The legal position is that ZFS is open source under a copyleft license but that many people think that it's illegal to bundle it with the Linux kernel because of some (I think unintended) incompatibilities between its license and the Linux kernel's license. Canonical (and some others) disagree, and think that it's legal. It's only the bundling that's at issue - everyone agrees that it's legal to use with Linux once you have both.
See https://ubuntu.com/blog/zfs-licensing-and-linux for Canonical's opinion.
Here is some more info on the subject: https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/
Nope. Simon Phipps, Sun's Chief Open Source Officer at the time, and the boss of Danese Cooper—who is the source of the claims it was intended—has stated it was not:
Brian Cantrill, the co-author of Dtrace, has also stated that that they were expecting the Linux folks to potentially incorporate their work:
Do you have any citations that can corroborate the/your claim that incompatibility was intended?
> Here is some more info on the subject:
I think there's also a general suspicion that Sun could have just chosen the GPL if they cared about compatibility. Although, for various reasons, it's probably at least somewhat more complicated than that because of patent protection, etc.
There were 'technical' reasons why they did go not with GPL, and specifically GPLv2 (GPLv3 was not out yet). IIRC, they did consider waiting for GPLv3, but it was unknown when it would be out, and one thing they desired was a patent grant, which v2 does not have.
Another condition was that they wanted a file-based copyright rather than a work-based copyright (i.e., applies to any individual files of ZFS as opposed to "ZFS" in aggregate).
That's a very loaded statement. I've seen it said quite a lot over the years. But, have you thought about its implications?
The implicit assumption here is the primacy of the GPL over all other open source licences. Why should other companies and organisations treat it as "more special" than any other free/open source licence when it comes down to interoperability?
When it comes down to compatibility, the GPL is one of the last licences you should choose. Because by its very nature it is deliberately and intentionally incompatible with everything other than the most permissive licences. The problem with "viral" licences like the GPL is that "there can only be one" because they are mutually incompatible by nature. Why should the MPL/Apache/CDDL licences make special exemptions to lessen their requirements so that they can be GPL-compatible?
Nit: Apache 2.0 is compatible with GPLv3 (but not v2).
How would it work anyway? It's not CDDL or MPL that causes the incompatibility, it's the GPL.
"This means a module covered by the GPL and a module covered by the CDDL cannot legally be linked together."
I don't think the incompatibilities are really unintended. If anyone has enough lawyers and money to relicense something correctly, it's Oracle.
ZFS on Linux does not break the intention and spirit of the kernel licence.
Unless you're ok with data loss in the case of power-outage, you'll want to use a mirror for SLOG at the very least. You can do that by making a mirror from two partitions on separate SSDs and then adding that as the SLOG. The partitions do not have to be very large, just enough to handle a minute or so of writes, so 10GB or so should is often plenty.
Also keep in mind that ZFS does a lot of shuffling for the L2ARC. I had <1% L2ARC hits on my pool with a 128GB L2ARC partition, but almost a TB of writes per day to the L2ARC due to ZFS rotating data in and out of it.
Of course it isn't. There can't be a single best filesystem. For the use case of zfs, yes it is the best filesystem.
It's been that way for years throughout various ZFS versions, and it's driving me crazy.
Are there any simple to use tools that automatically compare same files from different source and tell if they are different. I have multiple copies of Data but not knowing which file is corrupted and working through it is a pain in the bottom.
(Dell has equivalent products: For this discussion, iDRAC is the equivalent of iLO, and PERC is the equivalent to Smart Array.)
With those products, the RAID controller (Smart Array or PERC) will be connected to internal and/or external drives, will handle RAID in hardware (ideally with a battery backup write cache), and (through the iLO or iDRAC) generate alerts when a drive fails (or is close to failure).
In the context of ZFS, you don't have that. Your drive controller is either on-motherboard, or a PCIe card like (for example) the (Broadcom) LSI SAS 9300-8e. Those cards do have a RAID option (MegaRAID), but they are often used without.
The rest of the ZFS storage setup is pretty similar to the setup you are used to: Internal drives will have a SAS expander (if needed) on the motherboard, or will use a SAS expander card (for example, an HPE part #870549-B21 SAS expander card). External drives will be in a JBOD that has one or two expanders, and which is connected back to the server using SAS cables. One ZFS difference is that if you have many JBODs, instead of daisy-chaining arrays, you might choose to use a SAS switch (for example, an A54812-SW-01).
With all that I described with ZFS, I haven't mentioned how RAID is handled: ZFS handles RAID in software. RAID Z-1 is equivalent to RAID 5, Z-2 to 6, and Z-3 to 7. RAID also supports RAID 1 (two-drive mirrors), as well as a RAID 10 equivalent (striping across mirrored pairs of drives).
Since RAID is handled in software, and with the physical equipment I described, it is left to the OS to handle almost all monitoring an alerting. The one exception is that JBODs and rack-mount SAS switches (like the Astek) often have an Ethernet connection for monitoring and (basic) hardware control. But even that can often be handled within the OS, using SCSI Enclosure Services (SES, where the enclosure/switch itself is a device the OS can see and query).
I did briefly try ZFS on my laptop a year or so ago, and it ate up half of my RAM permanently. Since it was already fairly limited, that wasn't a sacrifice I wasn't willing to make either when I have plenty of backups anyway.
On your home desktop, you don't have to run dedup. You will get still the bitrot protection.
I hate the cargoculting on this fucking site.
On a desktop?
Chances are, that you don't have many users saving the same or slightly modified version of a file on the same storage. For a single person, it doesn't make much sense.
> "You can use ZFS just fine, just turn off one of its most useful features."
ZFS has many useful features. They come with a price though, because there's no free computation (see also laws of thermodynamics). It is then a matter of deciding, which features you want or need, and are willing to pay the price for.
You obviously are not willing the pay the price for dedup (lvmvdo asks for similar price, so it is not ZFS-specific), so why are you complaining that you cannot use it? ZFS still has many more useful features.
You also have another option: add RAM to your desktop. It is cheap. Then you will be able to use that one feature.
> I hate the cargoculting on this fucking site.
Sigh. I'm actually btrfs fan, all my data are on a btrfs volume (at work, we do use ZFS though, so I do have the experience). But that doesn't mean I won't point out something that the other club does well.
Please explain where this number comes from. I run ZFS on boxes with as little as 4GB of RAM, which are also doing all sorts of other things in addition to ZFS.
As with all filesystems on Linux, more RAM means more cache, and if you have free RAM you will benefit from a dynamically resized filesystem cache. That RAM is however not required and the cache can be evicted under memory pressure.
ZFS will use all available RAM for caching, unless you tell it otherwise (zfs_arc_max or similar), but it should release it when the system requires it.
(if you don't have any files)
Doesn't do me much good with my 10TB's of data.