14TB Hard Drives Now Available

loeg · on Nov 4, 2017

SMR drives can be thought of as tape libraries. You have a bunch of large, independent tapes that can all be written sequentially. But you can't do random writes to the tape. (SMR drives also have a lower density, non-SMR "scratch region" that can be used for metadata, or to build up to a larger I/O piecemeal. My metaphor is breaking down here.)

They're not generally useful for the kinds of applications harddrives are classically used for (classic filesystems). They're mostly useful for applications that already use tape, with the benefit of much quicker random reads.

Log filesystems may be able to take advantage of them. There will necessarily be I/O overhead to garbage collect deleted files out of SMR regions. (Because you have to rewrite the entire SMR region to compact.)

Is anyone using or planning to use SMR drives in production? If you're able to share, I'd be curious to learn about your use case and how you plan to make efficient use of the disk.

phil21 · on Nov 4, 2017

I use a bunch of 8TB SMR Seagate Archive drives in a ZFS pool for backups and media storage.

I run ZoL, and honestly they've surprisingly been relatively solid drives. You can expect roughly 10-15MB/sec or so write throughput per drive, and latency is of course pretty bad. Across 24 spindles though, I haven't had too many complaints - it's archival storage and reads are fast enough for most use cases - about 50-70MB/spindle sustained for large files.

I would not try to say rsync millions of tiny files against this pool - it would not hold up well. However for it's use case - write large files once/read occasionally I'm quite happy - I just wish price would have come down over the course of 2 years as I had originally expected. You can expect to pay $200-220/drive even today, and that's what I paid the first batch when starting my pool.

Out of 24 total spindles I had 2 early failures (within 120 days of install) but otherwise no other failures. These drives were replaced hassle free via RMA. My I/O pattern is pretty light - probably 100GB written per day across 24 spindles, and maybe 1TB read.

Basically if you try to do anything but streaming writes you're going to have a bad time. They are a bit more forgiving on the read side of the fence however. Don't expect these things to break any sort of speed records!

If I were buying today I wouldn't buy 8TB SMR - I'd pay the $20-30/spindle premium for standard drives. I'd have to look at the 14TB costs to see if the huge speed tradeoff would be worth it. When I first started using them, the cost per GB was compelling enough to give it a shot and I'm pretty happy with the results.

mozumder · on Nov 4, 2017

What do you need to do to get FreeBSD to use SMR drives properly with ZFS? Are there special drivers?

It seems ideal for a write-once/read-many media server.

phil21 · on Nov 4, 2017

These are drive-managed SMR, so no special drivers needed. I agree they are ideal for write-once/read-many media servers :)

Today I'm not sure the 8TB drives make any sense as prices have come down so far on regular 5400rpm slower drives that are much faster. These new 14TB spindles will be interesting to keep an eye on.

I imagine SMR won't really take off, if it does I'd expect more direct kernel/driver support for the hardware. Drive-managed SMR is always going to be exceedingly inefficient.

fulafel · on Nov 4, 2017

SMR drives have been shipping since 2014 (eg Seagate Archive HDD), and people are using classic filesystems with them to replace media storage, backups, and other largish file apps. It's much slower than traditional spinning HDDs at small write IO's, once you blow the write cache layer, but not tape-level terrible. You wouldn't use one for your root FS. But then again you wouldn't use a spinning HDD there anyway.

There are ideas about "host-managed SMR" too where the HDD device presents the low level SMR device, and would need customized filesystems and drivers. But the idea isn't very attractive in the marketplace.

blattimwind · on Nov 4, 2017

For sequential writes they (i.e. the Samsung Archive v2) are relatively decent (about 80-150 MB/s depending on the mood of the drive).

jldugger · on Nov 4, 2017

Supposedly Facebook is trialling them with cold storage -- think backups they must keep legally but don't intend on using much.

I could see a usecase for certain NAS / media storage purposes, but you'd have to not only use log structured filesystems, but also redesign the network protocols to support efficient writes, and possibly specialize block allocation to match the hardware constraints. You certainly wouldn't want say bittorrent writing directly to them, and even streaming-like services like MythTV may be a problem.

cm2187 · on Nov 4, 2017

They are probably a good choice for timemachine like backups, where every write is incremental.

poizan42 · on Nov 4, 2017

So as I understand the primary issue with SMR drives is that non-sequential writes tends to be slow because several tracks needs to be rewritten. Wouldn't the perfect solution be to make it a hybrid disk instead (i.e. add a small SSD to it)? That way the data can be written to SSD quick and then the firmware can move it onto the platters slowly.

Of course there are some issues about balancing SSD size vs. cost and write cycles, but at least if designed properly the failure mode once the SSD part is weared down would be that it just functions as a regular (device managed) SMR drive.

jamilbk · on Nov 4, 2017

This can essentially be accomplished with ZFS and an SSD as a L2ARC device.

I have three 8 TB Seagate SMR drives in RAID-Z (aka RAID-5) with a 500 GB Samsung 950 Pro M.2 as a L2ARC cache drive on the pool and it works beautifully for my workload.

In my experience, once the scratch area of the SMR drive is full, I get about 4-5 MB/s sustained write speed, which in my pool translates to 10 MB/s for the pool. Since the scratch area is about 25 GB, that means I can do 500 GB + 40-50GB of random writes before things slow to a crawl, and I have to wait (550 GB / 10 MB/s) = 55,000 s ~ 16 hours for the writes to flush.

I paid $179.99 each for these. Not Too Shabby.

jedberg · on Nov 4, 2017

Sounds like they would be perfect for Cassandra, or anything else that uses SSTables.

manigandham · on Nov 4, 2017

Why? SSTables/LSM storage for file data is completely separate (and much higher level) from physical storage tech like SMR. These drives are very slow compared to even standard hard drives and probably cant even keep up with writes, let alone compaction.

white-flame · on Nov 4, 2017

I bought an 8TB shingled drive. While the initial hundreds of MB of backup files wrote at reasonable speeds, it quickly dropped down to 4-5MB/sec sustained average for the rest, until it was idle for hours to catch up. Host management probably can help, but I'm not touching another SMR drive again.

loeg · on Nov 4, 2017

Yeah, that's the problem with drive-managed SMR — these drives are totally unusable by SMR-naive applications, so there's no point hiding it from the host. I think SMR drives have very niche application.

ioquatix · on Nov 4, 2017

I guess if you use these for a log-style database, write only backup (e.g. a replacement for tapes with better random read), you'll be happy. But for a general purpose drive they suck.

white-flame · on Nov 4, 2017

What I was doing was large backups, no overwriting, and it slowed down massively. I reformatted to a linear log filesystem, and still had the same unusable performance trend. However, since the drive was already somewhat used, and the drive itself knows nothing about the filesystem but only about sectors, I'm sure it was shuffling all the old data around as well.

Strangely enough, I could imagine it might work for a primary hard drive, as writes tend to be small and bursty allowing the SMR shuffling to catch up. But installation would take days.

Marketing them as "Archive" drives as Seagate did is the absolute wrong case for these. It's impossible to get any backup/archive copied over in any timely fashion. As a live mirror which gets piecemeal changes as they happen, then maybe. But that's still not an "archive".

mynewtb · on Nov 4, 2017

> since the drive was already somewhat used

You are spreading a lot of FUD based on one single anecdote with a used drive. Let me counter that anecdote with my own complete satisfaction of using such drive for over a year of daily backups without anything 'taking days'.

white-flame · on Nov 6, 2017

I bought it new. It was used by me in a prior filesystem (ext4) to attempt backing up files, so the drive had actively written user sectors on it before trying out a linear log filesystem for the same backups. If instead I had used such a FS on the drive in its new state, there might have been a possibility of better performance, depending on the internal drive management software. The notion of returning it to a state where it considered sectors unused is also dependent on that software.

Dylan16807 · on Nov 4, 2017

Why does it matter that the drive was used? If I can permanently damage my performance by using a different filesystem for a while, that's a big deal. It's not FUD to talk about how drive-managed SMR is unpredictable, and might work great or might work horribly, and you can't entirely predict what you'll get.

mynewtb · on Nov 4, 2017

Why should filesystem specifics matter to the device itself? If you believe that is the case, did you rewrite the whole drive with zeros or something, or better yet use some device builtin erase function?

Dylan16807 · on Nov 4, 2017

> Why should filesystem specifics matter to the device itself?

I agree! That's why I think "drive was used" should not be a disqualifier for seeing horrible performance. Why should a former filesystem matter?

> did you rewrite

Not my drive.

mynewtb · on Nov 4, 2017

A used drive might have been treated badly, have thousands of online hours etc

lucaspiller · on Nov 4, 2017

> I reformatted to a linear log filesystem ... I'm sure it was shuffling all the old data around as well.

I’d assume for these you should follow the same advice as SSDs, and issue the ATA Secure Erase command to have the drive wipe itself (or as is typically the case now, just it’s internal state and encryption keys):

https://ata.wiki.kernel.org/index.php/ATA_Secure_Erase

Sami_Lehtinen · on Nov 4, 2017

SMR drives would benefit from TRIM / Discard support. That's what I've been speculating. Afaik SMR drives aren't using TRIM?

baobrien · on Nov 4, 2017

Proper host-aware and host-managed SMR drives have their own command set to manage zones. It's kind of like TRIM, but specific to SMR drives.

moonbug22 · on Nov 4, 2017

They're not for single drive applications. They belong in object storage arrays.

white-flame · on Nov 9, 2017

Seagate sold (sells?) them as consumer packaged single external USB-3 drives, so at least their marketing posture doesn't promote them only for that particular use. Which is part of the problem.

pavs · on Nov 4, 2017

True. My rule of thumb, anything over 4TB, 7200rpm or not - I use only for long backup. Not on production app that needs good IO speed.

loeg · on Nov 4, 2017

This is kind of a non-sequitur. SMR drives have vastly worse performance than non-SMR drives. Neither RPM nor capacity has anything to do with SMR or non-SMR.

white-flame · on Nov 4, 2017

Higher capacity drives often have higher platter data rates, because of the increased density (or more platters & heads) at the same rotational speed. They should have better I/O throughput than 4TB drives... as long as they're not SMR. :-P A SMR drive that takes multiple days to transfer 500GB to it isn't useful as a maintained backup either.

phil21 · on Nov 4, 2017

SMR drives really aren't intended at all for single drive applications.

Get 512 of them in a giant pool (especially if it's SMR aware) and the throughput issues start to be less of a concern for certain applications. Anything write intensive of course means you immediately look elsewhere.

jldugger · on Nov 4, 2017

At that rate by my estimate, it'd take 324 days to actually write 14TB to disk. Probably not a great candidate for most applications.

aembleton · on Nov 4, 2017

"2.5 million hours MTBF"

How do they calculate that? That works out as 285 years!

Half of these drives will last 285 years of constant operation. I find that hard to believe.

alvis · on Nov 4, 2017

These two wiki explain well what this number mean.

[1] https://en.wikipedia.org/wiki/Reliability_engineering [2] https://en.wikipedia.org/wiki/Survival_analysis

jsmthrowaway · on Nov 4, 2017

It’s predicted statistical failure rate in aggregate for the drive model. Two drives running for 24 hours is 48 hours for MTBF purposes, so scale MTBF by the number of drives. 1,000 drives would expect to have one failure per 104 days. 10,000 drives would expect to have one failure about every week and a half. It’s not “half these drives will run 2.5 million hours.” I agree, the “mean” is confusing until you know that. It also doesn’t predict low quantities well, including 1, for obvious reasons. The larger the quantity gets, the better the prediction tends to get IME. Backblaze has published work on this.

Failure rate is also independent of lifetime. Drives have other measurements for lifetime.

yosito · on Nov 4, 2017

So, in layman's terms, if I buy two of these and mirror them is it a pretty safe bet they'll last my whole life?

jsmthrowaway · on Nov 4, 2017

No, because design lifetime is distinct from failure rate. Failure rate is just that: a predicted rate of failure (not lifetime) in aggregate for a model within design lifetime. Beyond lifetime, all bets are off. Think of this MTBF as saying “your two drives probably won’t fail within lifetime. A significant number of your 10,000 will.”

Regarding longevity, often the predicted lifetime of a drive is close to its warranty. You will sometimes experience no issues exceeding design lifetime, and sometimes drives immediately explode. I’ve seen both, from four-year lifetime drives entering year 13 in continuous service to other drives buying the farm one day after lifetime and SMART wear indicator is fired.

As drives age, mechanical disruption becomes a much bigger deal. That rack of 13-year drives is probably one earthquake or heavy walker away from completely dead in every U. Even power loss, including from regular shutdown, will probably permanently end the drive when they’re far beyond lifetime. That’s the danger in a 24x7 server setting if you’re not monitoring SMART wear indicators (even if you are, really); power cycling your rack can, and does, trigger multiple hardware failures. All the time. If all the drives in it were from the same batch, installed at the same time, and an equal amount past lifetime, it’s very possible for the whole rack to fail when cycled — I have actually heard of this happening, once.

MTBF is unexpected failure. Design lifetime is expected failure.

DamonHD · on Nov 4, 2017

I've had a nice rack of discs, though small by today's standards, AFAIK well within any MTBF, wrecked when some junior decided to see what the phased shutdown button in the machine room did during the middle of the bank's trading day, cutting off half the power to my cabinet at one point, which SCSI doesn't protect against.

There's always events dear boy, events!

Spooky23 · on Nov 4, 2017

Nooo. MBTF is a engineering term of art.

MTBF lets you compare between two drives intended for a similar purpose.

For something that you own at home, the only medium that will last your lifetime in loosely controlled environmental conditions is paper.

wongarsu · on Nov 4, 2017

High-quality paper. A lot of places have problems that the paper records they have to keep for 10 years become unreadable in less than 10 years.

Archival CDs might also last a lifetime (regular CDs certainly won't).

But I agree with your point that the MBTF of a single unit is not a way to predict it's lifetime.

Spooky23 · on Nov 4, 2017

That's a great point.

It dawned on me when some older relatives died a few years ago that their papers, some very old, survive with a modicum of careful handling. My grandfather's immigration papers, great-grandmothers portrait on her wedding day, etc.

Today, many of us are one expired credit card away from losing all of that in digital form.

mozumder · on Nov 4, 2017

What are acceptable low-number aggregate? 10? 100?

jsmthrowaway · on Nov 4, 2017

When the calculated result starts exceeding what is reasonable or makes any sense. It varies, and is a gut feeling. The more you have, the better.

Having 100 of this model would predict a failure about every three years (but that doesn’t mean it’d take 300 years to fail all of them). I’d be suspicious of a three year calculation, but it very well might turn out to be accurate. Remember that lifetime and failure rate are independent metrics. They’re warranted for five years, which is probably close to their predicted lifetime, and one drive of 100 failing in that time is certainly plausible.

Having 10 basically implies none of them will fail within a five-year lifetime. Again, plausible, but I think less likely.

piaste · on Nov 4, 2017

I assume you take the first batch of N drives, run them for K weeks, see how many fail, and extrapolate.

The methodology ought to be published, though.

jsmthrowaway · on Nov 4, 2017

It is. There’s a JEDEC standard for SSDs and probably one for spinners. Different companies use different write loads to find it. You’re pretty bang on, actually, but some vendors test for years.

dmitrygr · on Nov 4, 2017

Does Linux kernel have the code to manage these SMR drives as they need to be? (Seems like HGST purposefully left that to the OS to do - imho a smart choice)

wmf · on Nov 4, 2017

Here's the recent-ish status: http://events.linuxfoundation.org/sites/events/files/slides/...

Looks like F2FS should work as of Linux 4.10.

loeg · on Nov 4, 2017

See slide 21 — F2FS garbage collection after file/directory deletion is very expensive compared to classical filesystems on conventional disks.

koverstreet · on Nov 4, 2017

If someone was looking to jump into working on bcachefs, adding SMR support would be an excellent project.

mjw1007 · on Nov 4, 2017

4.13 (released this September) added the dm-zoned device-mapper target, which presents an SMR drive as if it was a normal block device.

I suppose what it does is similar to what drive-managed SMR drives do internally.

valarauca1 · on Nov 4, 2017

Yes https://lwn.net/Articles/637035/

loeg · on Nov 4, 2017

Note that the conclusion of that article is that developers are thinking about it, not that it is solved in Linux. Linux may detect SMR drives and query them for zone information, but I don't believe ext4 or any other Linux filesystem works particularly well on them. I would love to be shown otherwise, though.

> Reinecke wondered if the host-managed SMR drives would actually sell. Petersen piled on, noting that the flash-device makers had made lots of requests for extra code to support their devices, but that eventually all of those requests disappeared when those types of devices didn't sell. Reinecke's conclusion was that it may not make a lot sense to try to make an existing filesystem work for host-managed SMR drives.

Edit: wmf points out that F2FS works reasonably well[0], until you have to do a garbage collection pass.

[0]: http://events.linuxfoundation.org/sites/events/files/slides/...

tempestn · on Nov 4, 2017

If you want a reasonably massive HDD for (relatively) cheap that's usable in general purpose applications, the Seagate Ironwolf is starting to look pretty good. Their 10TB drive has been hovering around the same $/TB as an average 4TB lately.

dghughes · on Nov 4, 2017

I'm curious why helium would be used instead of a vacuum. I can imagine a vacuum would be difficult to maintain but would it be any worse than preventing helium from leaking out?

Helium is rare and critical for research and MRI machines it seems wasteful to use it in hard drives.

hedora · on Nov 4, 2017

The heads actually fly over the platter, airplane style. Vacuum would make them crash. The molecules in the atmosphere are too big for the tolerances they need for the head to track properly, so they use Helium.

Note this is a much less wasteful use case than children’s balloons.

Also, modern MRI machines and other research equipment wouldn’t be feasible without massive amounts of storage to back them (though these drives are probably a poor fit for those use cases, since they are soft-realtime, and shingled drives can stall for a long time in the worst case).

ykler · on Nov 4, 2017

Hard drives need a gas because the heads use it to float over the platter. And helium is hardly that rare that it is an issue; it is used in balloons for children.

lamynator · on Nov 4, 2017

Helium used for balloons is low grade helium that has already been recovered from another process. Medical grade helium is virtually pure helium and that is in short supply. Question whether this HDD needs pure or low grade.

mentos · on Nov 4, 2017

My hesitation to a drive this large is that you stand to lose that much more in a failure?

dsr_ · on Nov 4, 2017

You are expected to deploy these in very large RAIDs or similar protection schemes.

They are terrible as single disks, or even in small arrays. Think of them as an alternative to tapes.

Dylan16807 · on Nov 4, 2017

I'd find it pretty catastrophic to lose one terabyte of data. The important part is having backups, not drive size.

known · on Nov 4, 2017

And 14TB RAM in future