Enabling Write-Read-Verify Feature on Disks

Given the appalling reliability of modern disks, any feature that helps ensure data integrity and early detection of failure has to be deemed a good thing. What caught my attention recently is that all of the Seagate Barracuda disks I have (a number of ST31000333AS, ST31000340AS and ST31000528AS models) support the Write-Read-Verify feature. But there is a snag – disks from different batches, even for the same model, seem to disagree about the default state of this feature. Worse, the feature gets reset to it’s default setting on every reboot. This wouldn’t be a problem if the usual tool for such things on Linux, hdparm, had an option for controlling the state of this feature – but it doesn’t. So I wrote a patch to add control of write-read-verify capability to hdparm. Hopefully this will help keep your data a little safer.

The Appalling Quality of Hard Disks

I decided to write this article after having spent the last three years fighting unreliable and buggy disks of certain brands. The most prominent anti-star of this article is the disk model HD501LJ – a 500GB SATA disk. If you Google the model number, I am sure you can find out who the manufacturer of it is.

The story begins back in February 2008 when I bought two of these disks for the machine I was building. Approximately 15 months later, both of the disks (used in a RAID1 stripe) failed about 20 minutes apart with massive unrecoverable media failure, taking the data with them. This was annoying and inconvenient, but I learned some decades ago about the importance of keeping backups, so it wasn’t that serious a setback – more an annoyance at the waste of time than anything else.

Before I say anything else, I would like to say that the quality of the service provided by Rexo (the company in UK that handles warranty replacements for this particular disk manufacturer) is superb. They always send the replacement disk the same day the faulty disks arrive, and as the saga that I am about to recount unfolded, they even sent couriers to pick up faulty disks and deliver the replacements at the same time. Their superb service was in fact the only reason why I bought more disks made by the same manufacturer. This turned out to have been a big mistake since Rexo no longer handle warranty services directly – you have to arrange for it via the manufacturer’s web site. No doubt they were too efficient and helpful toward the end customers for the manufacturer’s liking.

The real story begins with the disks that arrived as replacements under warranty. One of them worked OK, and passed all the cursory tests I threw at it (short and long SMART tests and a rudimentary pass of badblocks). The other initially passed the tests, but approximately 50% of the time, the actuator would get stuck and the disk would just click indefinitely when trying to power up. Power cycling typically rectified the problem but I wasn’t prepared to put up with it, so it went back to get replaced.

The next replacement disk exhibited a different interesting problem. It failed the SMART tests immediately, on a sector that was beyond the LBA addressable range. This turned into a pending sector and couldn’t be fixed because it wasn’t a writable sector – it was a spare sector for remapping unrecoverable sectors. Clearly the firmware is buggy in it’s handling of such a condition – it doesn’t handle the physical and logical block addressing correctly. Since this rendered built in SMART diagnostics useless, this disk went back to be replaced again. Another 4 disks arrived after it, all with the same issue. This indicates a systematic fault and very poor quality control procedures on refurbished disks.

Eventually my case got escalated to their engineering department, and one of the engineers hand-picked a disk on which the problem wasn’t manifesting, ran a full set of tests on it (including SMART, which shockingly do not form a part of the quality control check on disks from this manufacturer – or at least they didn’t form a part of the checks in 2009), and sent me that disk to replace the one that was faulty.

Now, another year later, and the problem manifested again on one of those disks (bad sector at LBA+1 address). The only way this could be cured (refusal to reallocate sectors on overwrite and bad sectors beyond the end of LBA addressable range) was by performing a secure erase. That made the disks afterwards pass the full SMART self diagnostics and badblocks tests. The HD501LJ and HD103UJ, however, had an additional problem. Once the security on the disks was activated and the password set (required to perform a secure erase), they didn’t automatically disable security upon the erase. It also appears that the security implementation is buggy, and if the disk is secured, it will cause the machine to crash during booting on certain combinations of BIOS/SATA controller and motherboard. I worked around this by putting the disks in a machine that didn’t end up crashing and disabling the security on the disks manually.

Over time I did a bit more investigative work on these disks, and found additional bugginess in the firmware. I found that unreadable sectors that come up as “pending” sectors in SMART, once written to, disappear rather than show up as reallocated sectors. The pending sector count goes down to 0, and the reallocated sector count stays at 0. This is extremely bad behaviour, and affects not just the HD501LJ mode, but also the 1TB HD103UJ disk from the same manufacturer. Since it isn’t limited to a specific model, it seems likely that it affects all disks made by this manufacturer. I should also point out that there is another model of a disk that I have observed the exact same behaviour from: WD5000AAKS. You have been warned – these disks lie about the number of reallocated sectors they have, which means that one of the most important metrics indicating the health of the disk is missing. In some cases these disks also refused to reallocate the sectors on overwrite. It is worth noting that Google’s research on disk reliability shows the sector reallocation count to be the most reliable indicator of the disk’s imminent failure. They have found that 40% of the failed disks show reallocated sectors (see Figure 14. in the linked document). It seems reasonable to assume that the reason the disks made by the two manufacturers in question do not track this value in order to reduce their warranty claims by making the problem remain unnoticed for longer – no doubt hoping that you won’t notice until just after the warranty period expires. Any manufacturer that does this doesn’t deserve your custom – spend your money instead on a brand of disks that works correctly! Consider yourself warned.

None of the remedial actions listed above are something an average user would likely be able to carry out, and a more knowledgeable user would get there in the end after spending more time on the task than the cost of a disk would justify. All I can say in conclusion is that unless your data and time are worthless, buy disks made by a more competent manufacturer. Good manufacturers make a bad model from time to time. Bad manufacturers make bad models all the time.

Update: I have recently come across an interesting article on disk failure rates. I cannot help but wonder how much higher would returns rates be on disks from the two manufacturers whose disks (mentioned above) I’ve had the misfortune to own if they weren’t misreporting reallocated sectors.