Hacker News new | past | comments | ask | show | jobs | submit login
Bcachefs, an Introduction/Exploration (asleson.org)
147 points by marbu 1 day ago | hide | past | favorite | 177 comments





Here’s my pet peeve regarding RAID: no RAID system I’ve ever used gracefully handles disks that come and go. Concretely: start with two disks in RAID1. Remove one. Mount in degraded mode. Write to a file. Unmount. Reconnect the removed disk. Mount again with both disks.

The results vary between annoying (need to restore / “resilver” and have no redundancy until it’s done; massively increased risk of data loss while doing so due to heavy IO load without redundancy and pointless loss of the redundancy that already exists) to catastrophic (outright corruption). The corollary is that RAID invariably works poorly with disks connected over using an interface that enumerates slowly or unreliably.

Yet most competent active-active database systems have no problems with this scenario!

I would love to see a RAID system that thinks of disks as nodes, properly elects leaders, and can efficiently fast-forward a disk that’s behind. A pile of USB-connected drives would work perfectly, would come up when a quorum was reached, and would behave correctly when only a varying subset of disks is available. Bonus points for also being able to run an array that spans multiple computers efficiently, but that would just be icing on the cake.


> The results vary between annoying (need to restore / “resilver” and have no redundancy until it’s done; massively increased risk of data loss while doing so due to heavy IO load without redundancy and pointless loss of the redundancy that already exists) to catastrophic (outright corruption).

I'm not sure what you expect?

RAID1 is a simple data copy, you made sure to make both disks contain different data. So there's two outcomes possible: either the system notices this and copies A to B or B to A to reestablish the redundancy, or it fails to notice and you get corruption.

Linux MD allows for partial sync with the bitmap. If the system knows something in the first 5% of the disk changed, it can limit itself to only syncing that 5%.

> Yet most competent active-active database systems have no problems with this scenario!

Because they're not RAID. The whole point of RAID is that it's extremely simple. This means it's a brute force method with some downsides, but in exchange it's extremely easy to reason about.


I mean “RAID” in the more general sense, including btrfs, ZFS, etc, not just old-school RAID.

USB connected disks introduce new problems, like random disconnections.

RAID is overkill for home use. It also does not solve backups and snapshots. I use one way syncthing with unlimited history, plus usb-sata adapter.


Beware, ZFS often hangs on USB disconnections, forcing a reboot:

https://github.com/openzfs/zfs/issues/3461


Yes for home use I prefer more computers with single disks each having one copy of the data than one with RAID.

That means you have no bitrot protection, in fact you’ve now increased that possibility.

What's your syncthing setup?

One way sync to couple of servers. Unlimited history in syncthing for backups.

Does ceph not fulfill your requirements here? Especially that last "spans multiple computers" bit.

Ceph doesn’t really nail the “I want to boot off this thing” use case. It would be interesting to try, though.

Ceph provides S3-compatible object store no? If so, just use s3backer[1] with a loopback mount and boot[2] off it?

I mean, sounds like a house of cards but, should be possible?

[1]: https://github.com/archiecobbs/s3backer

[2]: https://ersei.net/en/blog/fuse-root


I'd actually recommend instead going a bit further since it'll be more reliable and easier to setup on the client side. Use the iSCSI gateway and an RBD image instead. This'll get you the availability of Ceph and is much better supported than using FUSE or s3 for booting. You can even install windows on an iSCSI target and PXE boot it (disclaimer: i've only read about this being done, not actually done it) so that you don't need any local storage at all on the remote machine.

You'll still want a fast network (I'd recommend 10gbe on the server at least, 2.5gbe on clients though 1gbe will work you will notice it bottleneck in bursts) but that won't be any different than any other network booting/rooting process

https://docs.ceph.com/en/latest/rbd/iscsi-overview/


Oh wow how ironic. I totally forgot about iSCSI, despite having used it against Ceph for testing in my home lab. Yeah, definitely go for that.

Or the S3-fuse route if you just want to geek flex.


If you're going that way, you'll be much, much happier with RBD.

https://docs.ceph.com/en/reef/rbd/


By “I want to boot off this thing” I mean that, if I have a computer with three disks, I want to run a normal-ish distro on that computer with those three disks possibly as /home or /var or maybe even as /.

I’m sure it’s possible. I don’t think this is quite what Ceph is intended for :)


ZFS will come closest.

I have a ZFS mirror, where I have taken one disk out, added files to it elsewhere, returned it and reimported.

The pool immediately resilvered the new content onto the untouched drive.

Doing this on btrfs will require a rebalance, forcing all data on the disks to be rewritten.


> Doing this on btrfs will require a rebalance, forcing all data on the disks to be rewritten.

I believe btrfs replace will copy only the data that had a replica on the failing drive.


A ZFS resilver is fast if there's not much changed data, only takes a few minutes

I didn’t know that — thanks!

> Error handling on CRC read error > 2 or more copies of file, CRC on error, read other copy, data returned to userspace, does not correct bad copy

That's been implemented; in Linux 6.11 bcachefs will correct errors on read. See

> - Self healing on read IO/checksum error

in https://lore.kernel.org/linux-bcachefs/73rweeabpoypzqwyxa7hl...

Making it possible to scrub from userspace by walking and reading everything (tar -c /mnt/bcachefs >/dev/null).


Self healing is dangerous because it can potentially corrupt good data on disk, if RAM or other system component is flaky.

Repro: supposedly only good copy is copied to ram, ram corrupts bit, crc is recalculated using corrupted but, corrupted copy is written back to disk(s).


> crc is recalculated using corrupted bit

Why would it need to recalculate the CRC? The correct CRC (or other hash) for the data is already stored in the metadata trees; it's how it discovered that the data was corrupted in the first place. If it writes back corrupted data, it will be detected as corrupted again the next time.


Because CRC is in the on-disk data structure, not in the in-ram data structure. It is stripped upon reading to ram, and created upon writing to disk.

That's how bcachefs is designed right now.


That’s why you need ECC RAM.

Our RAM should all be ECC and our OSes should all be on self-healing filesystems.


The "why not btrfs" line boils down to "it took a long time to be stable".

That's a weird argument. Even if it's true, it is now stable, and has been for a long time. btrfs has long been my default, and I'd be wary of switching to something newer just because someone was mad that development took a long time.


In 2019, btrfs ate all my data after a power cut. Btrfs peeps said it sounded like my SSD was at fault. Well, ZFS is still chugging along on that drive. I am not surprised btrfs took ages to stabilize, and it will take ages again before I rely on it. I’ve had previous btrfs incidents too. I think the argument against btrfs is that it was not good enough when btrfs devs told people to use it in production for ages.

Anecdotally and absolutely not production experience here, but I've had a Synology device running btrfs for 7 or 8 years now. Only issue I ever had is when I shipped it cross country with the drives in it, but was able to recover just fine.

This includes plenty of random power losses.


They do use btrfs. However, Synology also uses some additional tools on top of btrfs. From what I remember (could be wrong about the precise details), they actually run mdadm on top of btrfs, and use mdadm in order to get the erasure coding and possibly the cache NVME disk too. (By erasure coding, I mean RAID 5/6, or SHR, which are still unstable generally in BTRFS).

I assume you mean running btrfs on top of md (mdadm) or dm (dmraid), not the other way around?

Woops, you are correct! And it looks like it is dmraid, not mdadm.

https://daltondur.st/syno_btrfs_1/

Sorry about that!


Yeah, in the last year and a half, I've had three btrfs file systems crash on me with the dreaded "parent transid verify failed". Two times it was out of the blue, third time was just after it filled up.

The people on IRC tend to default to "unless you're using an enterprise drive, it's probably buggy and doesn't respect write barriers", which shouldn't have mattered because there was no system crash involved.

Yes, I did test my RAM, I know it's fine. For comparison, I've (unintentionally) ran a ZFS system with bad RAM for years and it only manifested as an occasional checksum error.


> Yes, I did test my RAM, I know it's fine. For comparison, I've (unintentionally) ran a ZFS system with bad RAM for years and it only manifested as an occasional checksum error.

Just luck. Software can't defend itself against bad RAM. There's always the possibility that bad RAM will cause ZFS to corrupt itself in some way it can't recover itself from.

Everything is in RAM. The kernel, the ZFS code, everything. All of that is vulnerable to corruption. No matter how fancy ZFS is, it can't stop its own code from being corrupted. It's just luck that it didn't happen.


ECC RAM helps

> For comparison, I've (unintentionally) ran a ZFS system with bad RAM for years and it only manifested as an occasional checksum error.

Be careful though. If whatever data was to be written got corrupted early enough, ie before ZFS got to see it, it happily wrote corrupted data to disk with matching checksum and you're none the wiser. But yes, it didn't blow up the entire Filesystem unlike btrfs likes to do.


Btrfs never actually stabilized it's still garbage compared to ZFS

Care to substantiate that statement? It seems rather arbitrary to just say that it's garbage when it is running and has been running successfully for the vast majority of its users. It also offers two features that ZFS does not: the ability to grow a pool, and offline duplication.

Based on the reports of corruption and data loss from actual users, I don’t think this claim is true at all.

Does it even have RAID5?

Why should it matter? It's an extremely niche technology that's only interesting to some home users. I see no reasons why other users should care about a RAID level they're not interested in.

(I don't use btrfs or any other COW filesystem because of significantly worse performance with some kinds of workloads, but it has nothing to do with maturity of any of them.)


> Why should it matter? It's an extremely niche technology that's only interesting to some home users.

I use RAID-Z2 in lots of places for bulk storages purposes (HPC).

There's a reason why Ceph added erasure coding:

* https://ceph.io/en/news/blog/2017/new-luminous-erasure-codin...

* https://docs.ceph.com/en/latest/rados/operations/erasure-cod...

When you're talking about PB of data, storage efficiencies add up.


Wtf? This is a bizzare take. Facebook poured in millions of dollars of R&D into btrfs.

but they likely put money into features which they are interested in, and not raid56

Yes but you will lose data if you are writing to your array when the power goes out. RAIDZ (ZFS) does not have this problem. See BTRFS RAID5 write hole.

Btrfs is still unacceptably less reliable than ZFS, after _decades_ of development. This is unacceptable, IMHO. I've lost so much data due to btrfs corruption issues that I've (almost) stopped to use it completely nowadays. It's better to fight to keep the damned OpenZFS modules up to date and get an actual _reliable_ system instead of accepting the risk again.

> I've lost so much data due to btrfs corruption issues that I've (almost) stopped to use it completely nowadays.

Just out of curiosity: is there a specific reason you're not using plain-vanilla filesystems which _are_ stable?

Personal anecdote: i've only ever had serious corruption twice, 20-ish years ago, once with XFS and once with ReiserFS, and have primarily used the extN family of filesystems for most of the past 30 years. A filesystem only has to go corrupt on me once before i stop using it.

Edit to add a caveat: though i find the ideas behind ZFS, btrfs, etc., fascinating, i have no personal need for them so have never used them on personal systems (but did use ZFS on corporate Solaris systems many years ago). ext4 has always served me well, and comes with none of the caveats i regularly read about for any of the more advanced filesystems. Similarly, i've never needed an LVM or any such complexity. As the age-old wisdom goes, "complexity is your enemy," and keeping to simple filesystem setups has always served my personal systems/LAN well. i've also never once seen someone recover from filesystem corruption in a RAID environment by simply swapping out a disk (there's always been much more work involved), so i've never bought into the "RAID is the solution" camp.


> Just out of curiosity: is there a specific reason you're not using plain-vanilla filesystems which _are_ stable?

I'd guess that it is the classic case of figuring out if something works without using it being a lot harder than giving it a go and seeing what happens. I've accidentally taken out my own home folder in the past with ill-advised setups and it is an educational experience. I wouldn't recommend it professionally, but I can see the joy in using something unusual on a personal system. Keep backups of anything you really can't afford to lose.

And one bad experience isn't enough to get a feel for how reliable something is. It is better to stick with it even if it fails once or twice.


> And one bad experience isn't enough to get a feel for how reliable something is.

For non-critical subsystems, sure, but certain critical infrastructure has to get it right every time or it's an abject failure (barring interference from random cosmic rays and similar levels of problems). Filesystems have been around for the better part of a century, so should fall into the category of "solved problem" by now. i don't doubt that advanced filesystems are stupendously complex, but i do doubt the _need_ for such complexity beyond the sheer joy of programming one.

> It is better to stick with it even if it fails once or twice.

Like a pacemaker or dialysis machine, one proverbial strike is all i can give a filesystem before i switch implementations.


snapshots every 15 minutes are a big selling point of ZFS for me; losing a file to a tired

    $ grep bar foo.txt | tr A-Z a-z > foo.txt
is much more common than losing a disk

> losing a file to a tired ...

If the file isn't in source control, a backup, or auto-synced cloud storage, it can't be _that_ important. If it was in either, it could be recovered easily without replacing one's filesystem with one which needs hand-holding to keep it running. Shrug.


ZFS is the mechanism by which I implement local (via snapshots) and remote (via zfs send) backups on my user-facing machines.

- It can do 4x 15-minute snapshots, 24x hourly snapshots, 7x daily snapshots, 4x weekly snapshots, and 12x monthly snapshots, without making 51 copies of my files.

- Taking a snapshot has imperceptible performance impact.

- Snapshots are taken atomically.

- Snapshots can be booted from, if it's a system that's screwed up and not just one file.

- Snapshots can be accessed without disturbing the FS.

In my experience it hasn't required more hand-holding than ext4 past the initial install, but the OSes that most of my devices use either officially support ZFS or don't use package managers that will blindly upgrade a kernel past what out-of-tree modules I'm using will support, which I think fixes the most common issue people have with ZFS.


> is there a specific reason you're not using plain-vanilla filesystems which _are_ stable?

my personal reasons are raid + compression


Funny because I have the opposite experience. The main issue with btrfs is a lack of tooling for the layperson to not require btrfs-developer level knowledge to fix issues.

I've personally had drive failures, fs corruptions due to power loss (which is supposed not to happen on a cow filesystem), fs and file corruption due to ram bitflips, etc. All the times btrfs handled the situation perfectly, with the caveat that I needed the help from the btrfs developers. And they were very helpful!

So yeah, btrfs has a bad rep, but it is not as bad as the common feeling makes it look like.

(note that I still run btrfs raid 1, as I did not find real return of experience regarding raid 5 or 6)


fs corruption due to power loss happens on ext4 because the default settings only journal metadata for performance. I guess if everything is on batteries all the time this is fine, intolerable on systems without battery.

The FS should not be corrupted, only the contents of files that were written around the time of the power loss. Risking only file contents and not the FS itself is a tradeoff between performance and safety where you only get half of each. You can set it to full performance or full safety mode if you prefer.

True this is file corruption.

It's funny because Facebook uses btrfs for their systems & doesn't have these issues.

ZFS lovers need to stop this CoW against CoW violence.


Someone correct me if I'm wrong but to my understanding FB uses Btrfs in either RAID 0, 1, or 10 only and not any of the parity options.

RAID56 under Btrfs has some caveats but I'm not aware of any annecdata (or perhaps I'm just not searching hard enough) within the past few weeks or months about data loss when those caveats are taken under consideration.


> RAID56 under Btrfs has some caveats but I'm not aware of any annecdata (or perhaps I'm just not searching hard enough) within the past few weeks or months about data loss when those caveats are taken under consideration.

Yeah this is something that makes me consider trying raid56 on it. Though I don't have enough drives to dump my current data while re-making the array :D (perhaps this can be changed on the fly?)


What's your starting array look like? If you're already on Btrfs then I recall you could do something like `btrfs balance -d raid6 -m raid1c3 /`

https://btrfs.readthedocs.io/en/latest/Balance.html


Yeah I'm on btrfs raid 1 currently, with 1x1TB + 2x3TB + 2x4TB drives. Gotta love btrfs's flexibility regarding drive size :D

I'll have a look, thanks! I guess failing this will make me test by backup strategy that I have never tested in the past.


Out of curiosity, how much total storage do you get with that drive configuration? I've never tried "bundle of disks" mode with any file system because it's difficult to reason about how much disk space you end up with and what guarantees you have (although raid 1 should be straightforward, I suppose).

I get half of the raw capacity, so 7.5TB. Well a bit less due to metadata, 7.3TB as reported by df (6.9TiB).

For btrfs specifically there is an online calculator [1] that shows you the effective capacity for any arbitrary configuration. I use it whenever I add a drive to check whether it’s actually useful.

1: https://carfax.org.uk/btrfs-usage/?c=2&slo=1&shi=1&p=0&dg=1&...


> It's funny because Facebook uses btrfs for their systems & doesn't have these issues.

they likely have distributed layer on top which takes care of data corruption and losses happening on specific server


>It's better to fight to keep the damned OpenZFS modules up to date and get an actual _reliable_ system

Try CachyOS (or at least the ZFS-Kernel) it has excellent ZFS integration.


This. I may still give up on running ZFS on Linux due to the common (seemingly intentional from the Linux side) breakage, but for my existing systems switching them over to CachyOS repos has been a blessed relief.

Well i use mainly FreeBSD but have used CachyOS for about 3mo to have some systemd refresher :)

Hadn't heard of CachyOS before, looks very nice! Was looking to move to Arch from KDE Neon, but this might be a much better fit.

Well or don't move from arch and just use the cachyos repos:

https://wiki.cachyos.org/de/cachyos_repositories/how_to_add_...

No reinstall needed ;)


Fair point. Currently running KDE Neon though, which is Debian based so reinstall needed...

For me the killer feature of btrfs is "RAID 1 with different sized disks". For a small and cheap setup, this is perfect since a broken disk can be replaced with a bigger one and immediately (part of) the extra new disk space can be used. Other filesystems seem to only increase the size once all disks have been replaced with a bigger capacity one (last time I checked this was still the case for ZFS)

Exactly. Provisioning a completely different set of disks when running out of capacity might be fine for a company but not for home office.

How does that work? You have two 100gb drives in raid1, 80% full, you replace one with a 200gb disk and write 50gb to the array - how is your 130gb of data protected against either drive failing?

I don't know the ins-and-outs of btrfs in detail, but having dug into other systems that offer redundancy-over-uneven-device-sizes and assuming btrfs is at least similar: with your two drive example you won't be able to write another 50gb to that array.

For two devices, 1x redundancy (so 2x copies of everything) will always limit your storage to the size of the smaller device otherwise it is not possible to have two copies of everything you need to store. As soon as you add a third device of at leats 100gb (or replace the 100gb device with one at least 200gb) the other 100gb of your second device will immediately come into play.

Uneven device size support is most useful when:

♯ You have three or more devices, or plan to grow onto three or more from an initial pair.

♯ You want flexibility rwt array growth (support for uneven devices usually (but not always) comes with better support for dynamic array reshaping).

♯ You want better quick repair flexibility: if 4Tb drive fails, you can replace it with 2x2Tb if you don't have a working 4Tb unit on-hand.

♯ You want variable redundancy (support for uneven devices sometimes comes with support for variable redundancy: keeping 3+ copies of important data, or data you want to access fastest via striping of reads, 2 copies of other permanent data, and 1 copy of temporary storage, all in the same array). In this instance the “wasted” part of the 200gb drive in your example could be used for scratch data designated as not needing to be stored with redundancy.


It only works with 3+ disks. All data needs to be on two disks.

e.g. you have 3 100GB drives, total capacity in raid 1 is 150GB.

If you replace a broken one with a 200GB one, the total capacity will be increased to 200GB.


My understanding is that RAID1 is just a mirror, and all disks have identical contents. Are you talking about something else?

Traditional RAID1 will mirror whole drives, yes. BTRFS RAID1 will mirror chunks of data (iirc 1GB) on two drives. So you can have e.g. two 1TB drive and a 2TB one just fine.

You can do this with ZFS, you just have to do it manually, ie by partitioning up the disks in say 100GB or 1TB partitions, then construct vdevs using these partitions.

You can then extend the pool by adding more such partition-based vdevs as you replace disks, just add the new partitions and add new vdevs.

So if you have a 1TB disk, a 2TB disk and a 4 TB disk, you could have mirrors (d1p1,d3p1), (d2p1,d3p2), (d2p2,d3p3) with a total of 3TB mirrored, and 1 TB available. If you swap the 1TB for a 2TB disk, partition it, replacing the old d1p1 partition with the new and resilver, and once that's done you can add the mirror (d1p2,d3p4) and get the full 4TB redundant storage.

Not a great solution though, as it requires a lot of manual work, and especially write performance will suffer because ZFS will treat the vdevs as being separate and issue IOs in parallel to them, overloading the underlying devices.


Thanks, that makes sense. I think the bcachefs --replicas option does something similar

Yeah, BTRFS is really not good for any sort of redundancy, not even very good for multi-disk in general.

1. The scheduler doesn't really exist. IIRC it is PID % num disks.

2. The default balancing policy is super basic. (IIRC always write to the disk with the most free space).

3. Erasure coding is still experimental.

4. Replication can only be configured at the FS level. bcachefs can configure this per-file or per-directory.

bcachefs is still early but it shows that it is serious about multi-disk. You can lump any collection of disks together and it mostly does the right thing. It tracks performance of different drives to make requests optimally and balances write to gradually even out the drives (not lasering a newly added disk).

IMHO there is really no comparison. If it wasn't for the fact that bcachefs ate my data I would be using it.


That, plus offline deduplication.

Bcachefs has this too

Development taking long usually means that the model itself is too complicated to be done right in a reasonable time, which indicates that the "stable" implementation could still be buggy, but only if you stray away from the common path. It's hard to feel comfortable using such a software in a fundamental role such a file system.

> which indicates that the "stable" implementation could still be buggy, but only if you stray away from the common path

Or that the complexity is such that if a new bug is found, it may take a long time to be fixed because of the complexity, or it is fixed fast and has unexpected knock-on effects even for circumstances on the common path.

Something that takes a long time to be declared stable/reliable because of its complexity, needs to spend a long time after that declaration without significant issues before I'll actually trust it. Things like btrfs definitely live in this category.

bcachefs even won't be something I use for important storage until it has been battle-tested a bit more for a bit longer, though at this point it is much more likely to take over from my current simple ext4-on-RAID arrangement (and when/if it does, my backups might stay on ext4-on-RAID even longer).


I think it's not quite so simple. The problem of organising storage is at least complex, on a scale of "simple complicated complex chaotic". The inherent complexity might be impossible to reduce to something simple or even just complicated, except _maybe_ with layering (à la LVM2), each layer tackling one issue independently of the others. But then it's probably at the cost of performance and other efficiency. Each layer should work such that it does not interfere too much with the performance of other layers. Not easy.

Given the rather cheap price of durable storage these days, I would favour rock solid, high quality code for storing my data, at the expense of some optimisations. Then again, I still like RAID, instantaneous snapshots, COW, encryption, xattr, resizable partitions, CRC... It's it possible to have all this with acceptable performance and simple code bricks combined and layered on top of each other?


In this case I think it’s the case that bcachefs has only a very small set of developers working in it.

But that was not the case for btrfs.

> Development taking long usually means that the model itself is too complicated to be done right in a reasonable time

yeah, features rich/complete fs is complicated, that's why we have very few of them.


One interesting titbit I've only recently found out is that btrfs can't really serve reads from different drives in RAID1, it picks a drive based on the process id.

ZFS does something smarter here, it keeps track of the queue length for each drive in a mirror, and picks the one with the lowest number of pending requests.


It's not simply that it took a long time to become stable; it's that during this time where it was unstable a lot of people got exposed to btrfs by having it lose data.

Personally, I was one of those people. Very excited about the prospects of btrfs, switched several machines over to it to test, ended up with filesystem corruption and had to revert to ext. Now, whenever I peek at btrfs, I never see anything that's compelling over running ZFS, which I've run for close to 15+ years, and run hard, and have never had data loss. Even in the early days with zfs+fuse, where I could regularly crash the zfs fuse; the zfs+fuse developers quickly addressed every crash I ran into, once I put together a stress test.


> it is now stable, and has been for a long time.

Is it really? I must have missed the news. Back when it was released completely raw as a default for many distros, there were fundamental design level issues (e.g. "unbound internal fragmentation" reported by Shishkin). Plus all the reports and personal experiences of getting and trying to recover exotically shaped bricks when volume fills to 100% (which could happen at any time with btrfs). Is it all good now? Where can I read about btrfs behaving robustly when no free space is left?


Btrfs lost its credibility and many people would never trust it.

So a year ago i tried to repeat my old trick damaging btrfs (as a user NOT root). Fill the volume with dd if=/dev/urandom of=./file bs=2M && sync && rm ./file then reboot the machine and yes it still works, it's not booting anymore, bravo.

BTW: Even SLES SuseLinux Enterprise says use XFS for data btrfs just for the OS i wonder why


> BTW: Even SLES SuseLinux Enterprise says use XFS for data btrfs just for the OS i wonder why

Because XFS is far quicker for server-related software such as databases and virtual machines, which are weak points on btrfs due to its COW model.


Doesn't "chattr +C" give you back that performance, while still letting you keep the rest of the benefits of Btrfs?

Nodatacow is an ugly hack because it disables btrfs's core features for the affected data. It also should not be used with raid1.

Yeah and maybe additionally you want to keep your data and have a stable filesystem for them ;)

> So a year ago i tried to repeat my old trick damaging btrfs (as a user NOT root). Fill the volume with dd if=/dev/urandom of=./file bs=2M && sync && rm ./file then reboot the machine and yes it still works, it's not booting anymore, bravo.

Do you know how ZFS handles that?


Without any problems. No other Filesystem i tested has that problem (ext4, XFS, ZFS, NTFS, JFS, Nilfs2)

Good to hear:) My understanding is that it's easier to break a CoW filesystem like that because if you run out of space you can't even delete things (because that requires writing that change), so I'm not surprised that the rest (the non-CoW filesystems) did fine, but I'm happy to hear that ZFS also handles it.

As little as one year ago I experienced damage on a lightly used btrfs root partition on my laptop. Never again. I use ext4 root and ZFS for /home for snapshots and transparent compression now, all on top of LVM

btrfs has still many weird issues. e.g. you can't remove a drive if it has I/O errors, even if the rest of the array has still enough space to accompany the data.

You can do a replace, but then you need to buy a new drive.


btrfs is not stable, at least not for me. it lost my data only a couple months ago. no power cut, no disk failure, data just gone.

My personal grievances with btrfs are multifaceted.

- I never agreed with the btrfs default of root raid 1 system not booting up if a device is missing. I think the point of raid1 is to minimize downtime when losing a device and if you lose the other device before returning it to good state, that's 100% on you.

- Poor management tools compared to md (though bcachefs might be in the same boat). Some tools are poorly thought, e.g. there is a tool for defragmentation, but it undoes sharing (so snapshots and dedupped files get expanded).

- If a drive in raid1 drops but then later comes back, btrfs is still quite happy.

- Need of using btrfs balance, and in a certain way as well: https://github.com/kdave/btrfsmaintenance/blob/master/btrfs-... .

- At least it used to be difficult to recover when your filesystem becomes full. Helps if you have it on LVM volume with extra space.

- Snapshotting or having a clone of a btrfs volume is dangerous (due to the uuid-based volume participant scanning)

- I believe raid5/6 is still experimental?

- I've lost a filesystem to btrfs raid10 (but my backups are good).

- I have also rendered my bcachefs in a state where I could no longer write to the filesystem, but I was still able to read it. So I'm inclined to keep using bcachefs for the time being.

Overall I just have the impression that btrfs was complicated and ended up in a design dead-end, making improvements from hard to difficult, and I hope that bcachefs has made different base designs, making future improvements easier.

Yes, the number of developers for bcachefs is smaller, but frankly as long as it's possible for a project to advance with a single developer, it is going to be the most effective way to go—at the same time I hope this situation improves in the future.


> I never agreed with the btrfs default of root raid 1 system not booting up if a device is missing.

Add "degraded" to default mount options. Solved.


Bad defaults is a huge issue even when you can change the config to something sane.

Defaulting to degraded is a bad default. Mounting a btrfs device array degraded is extremely risky, and the device not booting means you'll actually notice and take action.

Md devices do "degraded" by default and it seems fine. Indeed, I believe this is the default operation of all other multi-device systems, but of course I cannot verify this claim. I dislike all features that by default prevent from booting my system up.

The annoying part of this is that if you do reboot the system, it will never end up responding to a ping, meaning you need to visit the host yourself. In practice it might even have other drives you could configure remotely to replace the broken device. I use md's hot spare support routinely: an extra drive is available, should any of the drives of the independent raids fail.

Granted, md also has decent good monitoring options with mdadm or just cat /proc/mdstat.


Defaults should favor safety over convenience. Fail fast and fail hard. Md-raid's defaults are simply wrong.

Based on this logic, should the default mode of operation be to stop I/O when a disk dies?

As someone else pointed out: how is that different from losing a disk while running? Do you want the file system to stop working or become read-only if one disk is lost while running too? I think the bahviour should be the same on boot as while running.

You want a flaky system to pick a random disk at every boot and pretend that's the good one?

The usual reason I've seen RAID 1 used for the OS drive is -so- it still boots if it loses one.

Not doing so is especially upsetting when you discover you forgot to flip the setting only when a drive fails with the machine in question several hours' drive away (standalone remote servers like that tend not to have console access).

I think 'refusing to boot' is probably the right default for a workstation, but on the whole I think I'd prefer that to be a default set by the workstation distro installer rather than the filesystem.


That sounds like the right default then. If you're doing a home install, you get that extra little bit of protection. If you're doing a professional remote server deployment, you should be a responsible adult understanding the choices - and run with scrubbing, smart and monitoring for failures.

"Will my RAID configuration designed so my system can still boot even if it loses a drive not actually let it still boot if it loses a drive?" is not a question that I think is fair to expect sysadmins to realise they need to ask.

Complete Principle of Least Surprise violation given everything else's (that I'm aware of) RAID1 setups will still boot fine.

Also said monitoring should then notify you of an unexpected reboot and/or a dropped out disk, which you can then remediate in a planned fashion.

If this was a new concept then defaulting all the safety knobs to max would seem pretty reasonable to me, but it's an established concept with established uses and expectations - a server distro's installer should not be defaulting to 'cause an unnecessary outage and require unplanned physical maintenance to remediate it.'


> and the device not booting means you'll actually notice and take action.

How often do people reboot their systems? IMO, if it's running (without going into "read-only filesystem" mode) with X disks, it should boot with the same X disks. Otherwise, it might be running fine for a long time, and it arbitrarily not coming back in case of a power failure (when otherwise it would be running fine) is an unnecessary trap.


Since RAID is not a backup, isn't availability the main point of RAID1?

Good luck doing that after the disk shuts down okay but never comes back online

This option is only needed of you can mount the filesystem, but only degraded/to. If you're in that situation, you can remount easily. If not, that's not the solution you need.

Add to the grub kernel option, rootflags=degraded.

Bcachefs was merged into the kernel only months ago, and had an immediate flurry of bug fixes due to the additional testing this brought. (It was in development for some years before that out of tree). That's the level of maturity that it is at. I think there's a hope that it will become more trustworthy than btrfs due to the developers success with bcache.

I've been running bcachefs on my main desktop and laptop (My really important data is on my fileserver or in my private git repo (including dotfiles), I'm not crazy) since cachyos made it an install option and it's honestly worked better and caused less problems than btrfs has for me in the past so far. Maybe I was just unlucky but btrfs caused read only filesystem issues and a catastrophic loss on a couple of my computers a few years ago. I am pretty impressed with bcachefs so far.

> since cachyos made it an install option and it's honestly worked better and caused less problems

0 problems in 2.5 months is not necessarily better than 1-2 problems in ~3 years, though. If we're just talking about the single partition boot drive use case, I think I'd go with the option that's had vastly more time to find and eliminate bugs. (If you're conservative about this stuff that probably means ext4, actually.)


I would agree with you if I didn't run my stuff the way I do (and I'd use zfs or maybe ext4). Pretty much all my important stuff is on a raidz6 file server with a secondary backup raidz6 file server locally pulling backups each night and backups being sent offsite streaming throughout the day. My dotfiles are synced to my local private git repo via yadm (although if I didn't have this system running already I would probably take the time to figure out nix home manager instead of yadm right now). I have a bunch of bash scripts I wrote to automate the most annoying parts of reinstalling as well, so what I am really risking by running bcachefs is about a half hour to reinstall cachyos on whatever system eats it and possibly some minor configuration changes I may not have synced to git via yadm yet.

Nice. Its advancement depends on folks like you, so thanks!

Hmmh, under "Why bcachefs?" we find

- Stability but also

- Constant refactorings

and later

"Disclaimer, my personal data is stored on ZFS"

A bit troubling, I find

"RAID0 behavior is default when using multiple disks" never have I ever had the need for RAID0 or have I seen a customer using it. I think it was at one time popular with gamers before SSDs became popular and cheap.

"RAID 5/6 (experimental)

    This is referred to as erasure coding and is listed as “DO NOT USE YET”, "
Well, you got to start somewhere, but a comparison with btrfs and ZFS seems premature.

> "Disclaimer, my personal data is stored on ZFS"

> A bit troubling, I find

I appreciated the candor

The approach of bcachefs developers is that they will only recommend it's usage if it's absolutely, 100% stable and won't eat your data. Bcachefs isn't in that state yet and the developers don't pretend it is.

This avoids the kind of trust issues that btrfs has


Does it? From the btrfs docs:

> The RAID56 feature provides striping and parity over several devices, same as the traditional RAID5/6. There are some implementation and design deficiencies that make it unreliable for some corner cases and the feature should not be used in production, only for evaluation or testing. The power failure safety for metadata with RAID56 is not 100%.


I wonder why ZFS is marked as not having de-dupe (deduplication).

AFAIK ZFS has had deduplication support for a very long time (2009) and now even does opportunistic block cloning with much less overhead.


ZFS online deduplication is not comparable with on-demand dedup offered btrfs and xfs and is prohibitively expensive for many workloads.

The new block cloning still had data corruption bugs quite recently.


>ZFS online deduplication is not comparable with on-demand dedup offered btrfs and xfs

But it has de-duplication, with your logic no non-CoW FS should be in that list because they are not comparable.


It is still deduplication.

Chart should have a bloc dedup and file dedup separated columns if it is deemed non comparable.


Also XFS has deduplication now, already for some time, at least one year or two.

btrfs has deduplication as well.

In theory full file deduplication exists in every filesystem that has cow/reflink support


fclones for example covers it well for any filesystem with reflinks:

https://lib.rs/crates/fclones

fclones group |fclones dedupe


Yes I sort of skip-read a lot of it after that.

> btrfs Encryption Y

btrfs doesn't have a built-in encryption.

> ZFS Encryption Y

I cannot find the discussion right now but I remember reading that they were considering a warning when enabling encryption because it was not really stable and people were running into crashes.

https://github.com/openzfs/zfs/issues?q=is%3Aissue+label%3A%...


That bug is old, is missing information and hasn't been closed while the reporter say the problem has been solved after an update.

I see it more as an administrative problem than an issue with ZFS encryption.



Anecdote: I've been using ZFS encryption for a looong time and never had any problems.

I'm running bcache, with lvm/luks and xfs on top, since >5 years on my desktop and it has been stable and partition manipulations, like resizes, worked without problems, albeit the tooling is not so well supported.

I bought new a new ssd and hdd for my desktop this year and looked into running bcachefs because it offers caching as well as native encryption and cow. I also determined that it is not production ready yet for my use case, my file system is the last thing I want to beta tester of. Investigated using bcache again, but opted to use lvm caching, as it offers better tooling and saves on one layer of block devices (with luks and btrfs on top). Performance is great and partition manipulations also worked flawless.

Hopefully bcachefs gains more traction and will be ready for production use, as it combines several useful features. My current setup still feels like making compromises.


> ZFS, pioneering COW filesystem, ... commendable, its block-based design diverges from modern extent-based systems due to complexities in implementing extents with snapshots.

why is this a bad thing?


How well does bcachefs handle databases and VMs? Those workloads are well-known to be btrfs' kryptonite whereas ZFS seems to tolerate them pretty well as long as one sets the correct recordsize (example: https://www.enterprisedb.com/blog/postgres-vs-file-systems-p...).


It's worth mentioning that bcachefs has gotten a fair bit of performance work since the last round of Phoronix benchmarks. Also there's some bug where the formatting tool selected 512 byte blocks by default instead of 4k byte blocks on drives where other filesystems picked 4k bytes, which impacted the Phoronix benchmarks. Unsure if that has been fixed yet.

IIRC bcachefs is going to add a non-COW mode which should be good for databases.

NoCOW on btrfs is a kludge because it disables the core features of btrfs and is dangerous to use with raid1. Since, ZFS doesn't even have a nocow mode, surely there are other ways of dealing with databases?

I had really high hopes for HAMMER2, including that it would one day be ported to Linux, but it seems to have remained firmly planted in Dragonfly BSD and it's not really clear what the status is. https://en.wikipedia.org/wiki/HAMMER2

An interesting analysis. I can't stop my brain from parsing the title as "B C A Chefs".

I recently tried btrfs on a new USB thumb drive. I immediately got hard freezes of my main (Linux) OS while working with the USB stick.

Never again.

I eagerly await bcachefs reaching maturity!


I hope you reported the issue. That smells like a bug beyond the scope of btrfs itself. The basic filesystem has been stable for a very long time.

Maybe that was a bad USB port? (I have one such port that intermittently disconnects)

I have a USB stick with btrfs + LUKS on Arch Linux and it never had a problem like this


Same port and USB stick worked fine with XFS and ext4.

Tried again with btrfs and hard freezes again.


>Same port and USB stick worked fine with XFS and ext4.

None of those file systems are not comparable to BTRFS since they're not COW. BTRFS isn't for crappy USB drives since it has a lot more overhead than EXT4 and XFS which the controllers and flash chips in junky USB drives can't handle.


It's not a crappy USB drive. It's a high-end high-performance Sandisk USB drive.

Regardless, I still expect the choice of filesystem to not hard freeze my OS.

I've had janky crappy USB drives before and with any other filesystem reads/writes might fail but I don't get a hard freeze.

One could argue it could be a bug in the Linux USB stack rather than a bug in the btrfs kernel driver. But what I do know is that when I pick btrfs I get problems.


>It's a high-end high-performance Sandisk USB drive

That doesn't mean much. How high end? USB thumb drives are still much slower than SSDs or even HDDs.

>Regardless, I still expect the choice of filesystem to not hard freeze my OS.

That's a very particular filesystem, that's not designed for such hardware. OF course you'll run into issues. The FS expects data at a throughput that the ARM controller in the USB thumb drive can't deliver. OF course it will have issues.


Yeah, no.

You're overestimating how much overhead btrfs has. 500MB/s read/write is way more than it needs. I wasn't even doing any sustained writes.

In any case, I've been burned by btrfs issues on SSDs several times since 2018. Metadata issues, unrecoverable problems, corruption.

If you're ok with a filesystem causing hard freezes, then all power to you.

Not me. In my world, a filesystem that has this many issues is not a filesystem I want to use.

Edit: Can't reply to your reply. So here will do. Irrelevant how fast the USB is. If the filesystem has to do things slower, then simply do things slower. There's no excuse for hard freezes.

> You're trying to drive a Ferrari on a dirt road and claiming the dirt road is ar fault when your Ferrari has issues.

So btrfs in this metaphor is the Ferrari? Yes, my Ferrari definitely 100% has issues XD

Anyway, this particular USB drive I was able to push to the limit with XFS. Sustained reads and writes. Way more than the manufacturer would have tested for. And no freezes. I rest my case.


>500MB/s read/write is way more than it needs. I wasn't even doing any sustained writes.

Benchmarked or claimed? For 500MB files or 4kb?

>If you're ok with a filesystem causing hard freezes, then all power to you.

To me it never caused that and I used in on some slow ass 5200 rpm HDDs. You have to understand that the firmware on the flash controller on that USB Stick was'nt built or tested for the workloads BTRFS triggers. Even some nvme SSDs are reported to act weird under it because the manufacturer tests the firmware for FAT/NTFS loads not BTRFS or other such FSs. So the issues is with your Thumb drive firmware which is independent of the claimed 500mb raw sequencial throughouput.

You're trying to drive a Ferrari on a dirt road and claiming the dirt road is ar fault when your Ferrari has issues.


Ah, HN is letting me reply now.

Irrelevant how fast the USB is. If the filesystem has to do things slower, then simply do things slower. There's no excuse for hard freezes.

> You're trying to drive a Ferrari on a dirt road and claiming the dirt road is ar fault when your Ferrari has issues.

So btrfs in this metaphor is the Ferrari? Yes, I agree with you that my Ferrari definitely 100% has issues XD

You keep talking about btrfs workloads as though it's something crazy for usb firmware to handle. It really isn't. There's a bit of performance overhead, but not so much as to be unusable. And it's just I/O at the end of the day, not some mysterious workload that pushes the wrong buttons. USB firmware is more likely to freak out when doing sustained I/O and the thing gets too hot.

Anyway, I did minimal writes with btrfs before the freezes, and this particular USB drive I was able to push to the limit with XFS. Sustained reads and writes in a benchmark workload. Way more than the manufacturer would have tested for. And no freezes. I rest my case.


Had to do a double-take on the UI of this blog. It looks identical to my notetaking app, Trilium.

Does it allow both shrinking and growing the FS? Really wish ZFS allowed shrinking.

The idea that a brand new filesystem might be more reliable than good 'ol BTRFS, which Facebook runs on basically their entire infrastructure, is downright laughable to me.

Btrfs is also far more reliable than ZFS in my view, because it has far far more real world testing, and is also much more actively developed.

Magical perfect elegant code isn't what makes a good filesystem: real world testing, iteration, and bugfixing is. BTRFS has more of that right now than anything else ever has.


I've had my own bad experiences with Btrfs (it doesn't behave well when close to full), and my intuition is that Facebook's use of it is in a limited operational domain. It works well for their use case (uploaded media I think?), combined with the way they manage and provision clusters. Letting random users loose on it uncovers a variety of failure modes and fixes are slow to come.

On the other hand, while I haven't used it for /, dipping my toes in bcachefs with recoverable data has been a pleasant experience. Compression, encryption, checksumming, deduplication, easy filesystem resizing, SSD acceleration, ease of adding devices… it's good to have it all in one place.


> my intuition is that Facebook's use of it is in a limited operational domain

That's not really true: it's deployed across a wide variety of workloads. Not databases, obviously, but reliability concerns have nothing to do with that.

My point isn't "they use it, it must be good": that's silly. My point is that they employ multiple full time engineers dedicated to finding and fixing the bugs in upstream Linux, and because of that, BTRFS is more well tested in practice than anything else out there today.

It doesn't matter how well thought out or "elegant" bcachefs or ZFS are: they don't have a team of full time engineers with access to thousands upon thousands of machines running the filesystem actively fixing bugs. That's what actually matters.

> Compression, encryption, checksumming, deduplication, easy filesystem resizing, SSD acceleration, ease of adding devices... it's good to have it all in one place.

BTRFS does all of that today.


Why is that laughable? I do not think that it is more reliable than btrfs but it is not a crazy idea either. There are a whole bunch of people in these comments with very recent btrfs reliability issues which have affected them and nobody with recent zfs reliability issues.

If anecdotes are meaningful (dubious, but I'll play along...), I can counter with mine: I've been running btrfs on bleeding edge kernels for a decade, and I've never seen a single data loss event.

ZFS has corruption bugs, this one was far worse than anything I've seen in btrfs recently: https://lists.freebsd.org/archives/freebsd-stable/2023-Novem...


I've been running it on hundreds of servers in prod since once of the lead devs gave a talk at LinuxCon 2014 saying it was good to go. Had a few performance issues here and there, especially on older kernels, but never any data loss

I also think you can't really compare them: ZFS more or less says "never use without ECC memory". BTRFS is run on just about any potato there is.

I myself would never run a file server without ECC and a UPS configured for a graceful shutdown. I have also never had any issues, but I only have about 10tb of data.


Mandatory xkcd comic: https://xkcd.com/927 (replace "standards" with "FS")

There are only two competitors to bcachefs: btrfs and zfs. So having a third player in this space is a good thing, especially since a lot of people (in my opinion for good reason) do not trust btrfs meaning there is only really zfs.

ZFS isn't a real competitor given it's not in the kernel and has legal troubles.

Who is downvoting this? Among the large Linux distributions, ZFS is only really supported by Ubuntu, and even that is on the level "Canonical lawyers reviewed this and believe they're safe". If the unmentionable company ever goes to court against them, you're in hot water. You'll have to migrate to FreeBSD or support yourself by building dkms modules. So you're taking a non-zero risk by adopting ZFS.

If you're really conservative with these things, as some of us are, you currently don't really have a single safe COW pick. (Smug FreeBSD users incoming.) I have most trust in bcachefs over the long term.


> You'll have to migrate to FreeBSD or support yourself by building dkms modules. So you're taking a non-zero risk by adopting ZFS.

DKMS works fine. And you can also take the NixOS approach where it's not built into the kernel but the system will never actually install a new kernel unless ZFS is supported on it and successfully builds for it.

I agree that it's still not ideal, though. I hope bcachefs thrives and takes a place of pride in the Linux world!


It's very well supported by NixOS, been using it for a decade or so.

The key difference is that we don't need to agree on a certain FS, whereas the reason for standards is interoperability.

I have both bcachefs and ext4 filesystems on the same machine, for different uses.


Situation is different though, we have very few modern filesystems and have desperately needed some diversion and competition in this area. I've been waiting decades for something like bcachefs, thought it would be btrfs but that turned into a disappointment - for my needs.

It seems like "modern filesystem" development focus shifted to distributed a few years back.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: