Your statistics are a bit off. If 1 cycle has a 33.3% chance of error, then 3 cycles have got a 70% chance of error, not ~100%... So this is obviously misleading.
How so? 1 cycle reaches 1/3 of the number of bits written they give in their data sheet. Unless drives are built so they work 100% reliably up to some limit and then suddenly explode, it seems reasonable to expect that passing 1/3 the way to the "target number" gives a 1/3 chance of a failure.
3 cycles, on the other hand reach (actually it's a little more) the number of bits they give in their data sheet as target number for "1 unrecoverable sector". Which, according to what's written means that 1 sector will be unrecoverable. Or something else, depending on what the definition means exactly.
But, reaching the "target" (whatever it is), is 100%, not 70%. No?
Does a "1 per 1014" rate mean that each individual read has a 1/1014 chance of error? Does it mean that the mean number of reads before 1 error occurs is 1014? That after 1014 reads, there's a 50/50 chance that there's been an error or not?
Good question. Who knows what the definition is, they don't tell.
I would read it as "after writing 1014 bits, it's allowed to have at most one unrecoverable sector". This interpretation comes from my general expectation is that a harddisk has zero failures (unless you hit it with a hammer or submerge it in water while powered on).
Of course that won't happen in practice. Cosmic radiation, radioactive decay, your cat walking close by, name whatever you like... a bit may always flip for no apparent reason. Both in RAM and on magnetic storage. Rarely, but it happens. For that, drives have error correction.
I have a fully functional 2 year old Samsung drive (if you can ever consider a Samsung drive fully functional) which reports 1924915 Hardware_ECC_Recovered events. This is considered 100/100 ("perfectly good") as user-displayed value, and indeed the drive has never actually shown any kind of failure. It has a reallocation count of 0 and a pending count of 5 (these values didn't change over the last 3 months, so apparently the controller isn't yet sure whether or not to remap those 5 sectors, they're probably the ones causing the ECC_Recovered events).
So I guess (but, who knows!) that the numbers they provide are something they guarantee (or rather specify, you don't really have a guarantee) as upper bound. Though, of course, you don't know when random failure happens. The harddisk might not even spin up when you power it for the first time, that's unlikely but possible (and it would be a one-billion sector failure with zero bits written...). So that "upper bound" would, probably, have to be seen as something like the boundary of the e.g. 99% confidence interval?
Or... something else? Maybe it's a completely made up bogus number that just looks very technical for marketing?
How do you even measure this in a somewhat reliable way? You'd have to test hundreds of drives until they produce a number of random failures (say, a dozen), count the exact number, and divide by the total number of bits written to get that many. That would be immensely time-consuming and expensive.
Maybe, after all, they did such a measurement in 1980 which said 1010 and since disks are bigger nowadays and technology advances, they simply "extrapolated" it to 1014. I wouldn't know, and there's hardly a way one could verify this
e.g. if each read operation has a 1/1014 chance of an error
In reality, each read operation has a much higher chance of an error, that's not the same thing, however. The drive will apply error correction and retry (again, with error correction) before reporting an error. And you might be lucky trying again next time.
It's not the same as a sector going defect over time, either. Drives will regularly reallocate sectors (copying data elsewhere) when the FEC is triggered more often than they like or when some other metric tells the controller that the signal/noise ratio in one location isn't that great. This is a "normal" thing.
They're saying unrecoverable sector, which means that no matter what the drive does and no matter how often you try, you're not getting your data back. You wrote data to disk, the drive reported "OK, you're good to go", and now the data is gone. Forever. As if you invested in Lehman's.
Now, inevitably, you're going to say "this is what backups are for". What can I say but "they are, and they are not". Not only do you not know if your backup doesn't contain an unrecoverable sector, but also the idea of a backup is not to plan losing data, but to be prepared if it happens (note the wording: if, not when).
4 trillion ops its a lot of ops
Not such a lot for a harddisk, though.
My Windows system disk has an average of 6.75 million writes per hour (dividing S.M.A.R.T. Lifetime_Writes by Power_On_Hours). Since it is a SSD, I am very careful not to use the system disk for volatile data (I'd hate having to setup Windows again because of disk failure). I don't install anything I do not really need, and I do not copy anything to the system disk that needs not absolutely be there.
Swap and temp are on ramdisk (but Windows gives a fuck, it still writes half of its temp files to C:\Windows\Temp), and data as well as programs that I update often (e.g. GCC) go onto the second disk (ironically also a SSD, but it would be less painful to replace than the system disk, obviously). Bulk data and things like downloads are hosted on the NAS, and that's where tasks like unpacking zip files and such happen, too (funny thing to download from the internet only to store on the network again, but heh).
Ironically, although the second drive is "used" a lot more than the first, it has fewer write operations than the system drive (only 36,000 per hour). The WD disk in the NAS doesn't provide a lifetime writes figure, so I can't tell how many writes it sees per hour.
I have all services such as indexing, superfetch, .NET optimization and every other shit that constantly accesses disk for no good reason disabled. No software on the system that is not necessary. Still, Windows manages to generate 600-900 writes per minute when you have no program open and walk away, i.e. not even touching the mouse. If you have a program like Firefox open (but not doing anything!), the number of writes is approximately doubled (presumably because it syncs its settings to disk every few seconds, all writes go to the user profile).
Now, 4 trillion is about 592,000 times 6.75 million, so you might say "ridiculous, never going to happen", but it really isn't. Yes, that is 67 years before failure, but this number is based on wrong figures. First, those are "writes", not "sectors written". Some of these writes will certainly have been larger than one sector, but in any case they were much larger than 1 bit (the numbers are relative to bits written, remember, and a device doesn't write less than one full sector!). In the case of that Seagate drive this thread is about, sectors are 4096 bytes, so it's more like 18 hours, not 67 years.
Also, if I hadn't disabled most Windows stuff, and if swap and temp didn't go onto ramdisk, and so on, I'd have anywhere from 50 to 100 times more writes as I have now (which is very optimistic, doing a single build alone generates several thousand temporary files, each with a dozen or so writes, so it might as well be 1000 times more). And suddenly those assumed 67 years (let's assume that although they say "bits", they really meant "sectors") are something between 8 months and 1 1/2 years, which seems very close and tangible.