The Appalling Quality of Hard Disks - AltechnativeAltechnative

I decided to write this article after having spent the last three years fighting unreliable and buggy disks of certain brands. The most prominent anti-star of this article is the disk model HD501LJ – a 500GB SATA disk. If you Google the model number, I am sure you can find out who the manufacturer of it is.

The story begins back in February 2008 when I bought two of these disks for the machine I was building. Approximately 15 months later, both of the disks (used in a RAID1 stripe) failed about 20 minutes apart with massive unrecoverable media failure, taking the data with them. This was annoying and inconvenient, but I learned some decades ago about the importance of keeping backups, so it wasn’t that serious a setback – more an annoyance at the waste of time than anything else.

Before I say anything else, I would like to say that the quality of the service provided by Rexo (the company in UK that handles warranty replacements for this particular disk manufacturer) is superb. They always send the replacement disk the same day the faulty disks arrive, and as the saga that I am about to recount unfolded, they even sent couriers to pick up faulty disks and deliver the replacements at the same time. Their superb service was in fact the only reason why I bought more disks made by the same manufacturer. This turned out to have been a big mistake since Rexo no longer handle warranty services directly – you have to arrange for it via the manufacturer’s web site. No doubt they were too efficient and helpful toward the end customers for the manufacturer’s liking.

The real story begins with the disks that arrived as replacements under warranty. One of them worked OK, and passed all the cursory tests I threw at it (short and long SMART tests and a rudimentary pass of badblocks). The other initially passed the tests, but approximately 50% of the time, the actuator would get stuck and the disk would just click indefinitely when trying to power up. Power cycling typically rectified the problem but I wasn’t prepared to put up with it, so it went back to get replaced.

The next replacement disk exhibited a different interesting problem. It failed the SMART tests immediately, on a sector that was beyond the LBA addressable range. This turned into a pending sector and couldn’t be fixed because it wasn’t a writable sector – it was a spare sector for remapping unrecoverable sectors. Clearly the firmware is buggy in it’s handling of such a condition – it doesn’t handle the physical and logical block addressing correctly. Since this rendered built in SMART diagnostics useless, this disk went back to be replaced again. Another 4 disks arrived after it, all with the same issue. This indicates a systematic fault and very poor quality control procedures on refurbished disks.

Eventually my case got escalated to their engineering department, and one of the engineers hand-picked a disk on which the problem wasn’t manifesting, ran a full set of tests on it (including SMART, which shockingly do not form a part of the quality control check on disks from this manufacturer – or at least they didn’t form a part of the checks in 2009), and sent me that disk to replace the one that was faulty.

Now, another year later, and the problem manifested again on one of those disks (bad sector at LBA+1 address). The only way this could be cured (refusal to reallocate sectors on overwrite and bad sectors beyond the end of LBA addressable range) was by performing a secure erase. That made the disks afterwards pass the full SMART self diagnostics and badblocks tests. The HD501LJ and HD103UJ, however, had an additional problem. Once the security on the disks was activated and the password set (required to perform a secure erase), they didn’t automatically disable security upon the erase. It also appears that the security implementation is buggy, and if the disk is secured, it will cause the machine to crash during booting on certain combinations of BIOS/SATA controller and motherboard. I worked around this by putting the disks in a machine that didn’t end up crashing and disabling the security on the disks manually.

Over time I did a bit more investigative work on these disks, and found additional bugginess in the firmware. I found that unreadable sectors that come up as “pending” sectors in SMART, once written to, disappear rather than show up as reallocated sectors. The pending sector count goes down to 0, and the reallocated sector count stays at 0. This is extremely bad behaviour, and affects not just the HD501LJ mode, but also the 1TB HD103UJ disk from the same manufacturer. Since it isn’t limited to a specific model, it seems likely that it affects all disks made by this manufacturer. I should also point out that there is another model of a disk that I have observed the exact same behaviour from: WD5000AAKS. You have been warned – these disks lie about the number of reallocated sectors they have, which means that one of the most important metrics indicating the health of the disk is missing. In some cases these disks also refused to reallocate the sectors on overwrite. It is worth noting that Google’s research on disk reliability shows the sector reallocation count to be the most reliable indicator of the disk’s imminent failure. They have found that 40% of the failed disks show reallocated sectors (see Figure 14. in the linked document). It seems reasonable to assume that the reason the disks made by the two manufacturers in question do not track this value in order to reduce their warranty claims by making the problem remain unnoticed for longer – no doubt hoping that you won’t notice until just after the warranty period expires. Any manufacturer that does this doesn’t deserve your custom – spend your money instead on a brand of disks that works correctly! Consider yourself warned.

None of the remedial actions listed above are something an average user would likely be able to carry out, and a more knowledgeable user would get there in the end after spending more time on the task than the cost of a disk would justify. All I can say in conclusion is that unless your data and time are worthless, buy disks made by a more competent manufacturer. Good manufacturers make a bad model from time to time. Bad manufacturers make bad models all the time.

Update: I have recently come across an interesting article on disk failure rates. I cannot help but wonder how much higher would returns rates be on disks from the two manufacturers whose disks (mentioned above) I’ve had the misfortune to own if they weren’t misreporting reallocated sectors.

4 thoughts on “The Appalling Quality of Hard Disks”

Pingback: Enabling Write-Read-Verify Feature on Disks | altechnative.net
Daniel Smedegaard Buus on August 11, 2011 at 11:57 am said:

Hi 🙂

Is it possible that these sectors weren’t reallocated at all, that the sectors failed ECC verification, were rewritten to the same location, reverified and found to be good, thus not needing a sector replacement from the spare pool?

I bought 12 HD501LJ drives from June ‘7, and while they haven’t been stellar, they’ve been fairly stable – currently three of them are RAID5’d in my desktop W7 box. I also have six of the 1TB 103UJ variants from Dec ‘8 that are also still good…

Regardless, I just started attaching some of the drives using a USB adapter, and most of them indeed come up claiming 0 reallocated sectors. I didn’t check them all, but I have one or two of the 500 GB drives somewhere that I know have a couple of bad sectors that can’t be remapped… Or maybe I didn’t do an LLF…). Of the five 103UJs tested, all report 0 reallocated sectors. Of the four 501LJ, just one of them reports 2 reallocated sectors, the others claim none. They all report some number of hardware recovered ECCs, though. All of them are rev. A, HD501LJs with firmware CR100-10, HD103UJs with firmware 1AA01113.
gordan on August 11, 2011 at 1:11 pm said:

Is it possible that these sectors weren’t reallocated at all, that the sectors failed ECC verification, were rewritten to the same location, reverified and found to be good, thus not needing a sector replacement from the spare pool?

It is plausible, but it is also wrong. If a part of media has exhibited a tendency to lose it’s charge, then re-writing it is not a solution, since if the media is dodgy – and it has already demonstrated itself as such – it will happen again. I would also question whether the disk even checks whethere the overwrite stuck, since it doesn’t support the Write-Read-Verify feature. If a sector has lost it’s data beyond recovery, then that sector should not be re-used. No reputable manufacturer of quality products would be doing that in their disks.
gweihir on August 27, 2011 at 12:11 pm said:

I have seen the realloc being zero despite having had pending sectors before as well on some disks. I believe this is correct behavior. You can damage sectors by vibration during write. (E.g. shout at the disk, need to try this some time…) These sectors will be perfectly fine after retest on write. The way to catch this is to run SMART monitoring software that sends email (or other alerts) on finding pending sectors.

I also noticed in at least one case that the “196 Reallocated_Event_Count” did correctly list the former pending sectors. Apparently it does count how often a sector was considered for reallocation even if it then was not. So I would say these disks are not actually lying, just the standard is poorly designed. As so often.

That does not invalidate your observation about the shoddy firmware quality and shoddy recertification process, of course. Although I have to say I have several drives from the unnamed manufacturers too and never had any unreasonable problems. The one drive that dies on me with massive reallocations was dropped before and I still got all my data off it.

Comments are closed.