Friday, July 9, 2010

RAID is for Disk Bugs

Last month I went on vacation to Calgary, and I shut down Gemini for the time. Everything seemed to be working fine when I left, so when I got back from vacation and turned Gemini on again I was surprised to find the system was running really slowly.

On a hunch I suspected there was a disk failure and my RAID 5 system was running degraded. I managed to find the RAID Management Software from Intel and installed it. Sure enough, it confirmed that the array was degraded while it was trying to rebuild disk 5. I came back hours later to find that the rebuild had not progressed past 3%. The RAID software was reporting there were 254 Unrecoverable Medium errors on that drive so I figure maybe there was something seriously wrong with it.

I replace disk 5 with a spare disk and started up the system, and sure enough it started rebuilding this disk. Things seemed to be progressing better, but still it was awesomely slow rebuilding. After several days it was only 10% complete. Then I started getting more of those Unrecoverable Medium Errors. But not only on the newly replaced disk 5, but disk 1 was also starting to report the same errors.

After a week had gone by the rebuild was at 81%, and the progress seemed to be going better so I was hopeful it would finish the next day. Much to my dismay the next day I discovered the rebuild had failed because there were no more entries in the bad-block table.

Storage on a disk drive is divided up into 512 byte units called sectors. Every drive has some spare sectors in case there are unrecoverable errors. In this case the sector with the error is remapped to one of the spare sectors via the bad-block table. Anyway, I ordered a new disk drive to replace the failed drive.

So now it looks like I have two failed disks, and one disk that is slowly failing. If another drive fails I will be royally hooped. In a RAID 5 system, if one disk fails the system will still run. But if two disks fail the whole file system is destroyed.