Wednesday, August 1, 2012

Keeping Cool

For quite some time I have been having a problem with one of my two CPUs running too hot. Last weekend I seemed to have fixed the problem and my second CPU is running much cooler now.

When I first assembled Gemini I used a water cooling system from CoolIT in Calgary Alberta. When I attached the water blocks to the CPUs I used Arctic Silver Cermanique 2 as the thermal bonding compound and followed the instructions precisely.  I know I did a good job because when I had to send the Intel motherboard back for replacement, after I took off the water blocks the pattern of the thermal compound was textbook perfect. When I got the replacement motherboard I reattached the water blocks as before with Ceramique 2 with confidence.

Eventually the bearings in the water pump started to fail, so I sent my system back to CoolIT in Calgary and the replaced the cooling system with one of their newer models. I was a little concerned at the time because when I installed the first cooling system it came with a thick coating of thermal compound slathered on the water blocks, which I had to remove thoroughly before using the Ceramique 2.

About a year ago I noticed that one of the CPUs was running too hot and did all kinds of investigation and research to understand the problem. In fact I thought the CPU was damaged because it was running so hot and drawing way more power than the other CPU.

Last weekend I took off the water block, and sure enough there was a thick layer of gray thermal compound. I cleaned it all off and carefully applied Ceramique 2 then attached the water block.

Right after powering on my computer I could see that the CPU was being cooled better than before, however there is a 27 hour burn-in for the Ceramique 2 - that is, it takes a while for the compound to spread out properly and stabilize. The next day the CPU thermal profile looked even better, but after 60 hours it looks even better. In fact the CPU thermal profile looks pretty much normal as it should. Sure the second CPU runs a little hotter than the first, but that is because it is downstream in the water cooling system. Also, the first CPU is running a little hotter than before because the second CPU is dumping more heat into the cooling system, which is what you would expect too if the second CPU is being cooled properly.

The lesson from all this is the when attaching heat sinks to your CPUs, the choice of thermal bonding compound and how you apply it is critical. The second lesson from this is that just because a company is expert at the mechanics of cooling systems does not mean they cannot also be complete idiots with respect to simple things like thermal bonding compounds and how to apply them. Sometimes expertise in one area can cause blindness in other areas.

Tuesday, July 17, 2012

The Long Road

When a man sets forth to carry a cat by its tail, he learns something valuable that will never grow dim or doubtful - Mark Twain

For the last few months I have been trying to repair the RAID-5 system on Gemini. It did not fail completely, but a couple of disks developed unrecoverable media errors at the same time. Much to my surprise what this means is that even if I replace a failing disk with a perfectly good new disk, the RAID system will not be able to successfully rebuild it because of media errors on a different disk.

Obviously the people who wrote the RAID software are complete morons.

It took me several tries and ultimately a phone call to Intel Technical Support to figure this out. All they could suggest was backup my files and reinitialize the RAID configuration. Thanks for nothing.

As I was going to have to reinitialize my RAID configuration I decided to get a real RAID Controller rather than rely on the Intel Software RAID implemented in the BMC (Baseboard Management Controller) of the system board. I have generally had nothing but problems with it from day one.

I ultimately decided on the LSI  9266-8i with cache backup. This is a high end SAS (Serial Attached SCSI) controller - so it can support both SAS and SATA (Serial ATA) storage devices. It has a 1 GB RAM cache to speed up operations, and an onboard flash memory with a bank of capacitors. In the event of a power failure, the capacitors will power the controller long enough to copy the cache to persistent flash memory. The next time the system powers on it will automatically restore the cache so you don't loose any data. The advantage of capacitors over batteries is they are more robust, last longer, and require less maintenance.

While the 9266 is an 8-lane PCI Express 2.0 device, sadly my system only has a 4-lane PCI Express 1.0 slot available. The card seems to run fine in this slot, but it just means there will not be as much I/O bandwidth available. While there will likely not be that kind of bandwidth to disk, the 9266 has a 1 GB RAM cache that is very fast.

Because the 9266 can support 8 disks I decided to purchase a couple of extra mounting brackets and two more enterprise class 2 TB disks. Consequently I will have an 8 disk array, instead of a 5 disk array, and a spare replacement disk. The 9266 can actually support up to 256 SAS storage devices if you daisy chain them together, but I have nowhere near that much room in my computer case.

I think I have learned my lesson with RAID 5 so I will use RAID 6 instead. While the write performance will be less than RAID 5, the cache on the 9266 seems to make a big difference so my performance will be better overall than using the on-board RAID 5. With RAID 6 you can loose two storage devices in the array without failure. I now know from first hand experience how likely it is to loose two devices at the same time.

It is interesting to note that the Intel Embedded Server RAID Technology II (ESRT2) used on the S5520SC is based on LSI technology, so the LSI RAID Management Software can manage both systems at the same time. Sadly, the 9266 has the same frustrating limitation that the ESRT2 does - namely that you cannot create virtual disks on the same array (disk group) with different RAID schemes. You can do this on other RAID systems, however, such as those with Intel's Matrix Storage Manager. For example with MSM, you can create a RAID 0 virtual disk (for a fast system drive) and a RAID 5 virtual disk (for a reliable data drive).

When I finally had all the pieces together to migrate everything from the old RAID to the new RAID, I then began the week from hell. Generally this should be simple.
  1. Backup all the stuff you want to keep.
  2. Reconfigure the RAID.
  3. Restore all the stuff you saved.
Sadly I had nothing but problems doing this. Part of the problem was I need to move the disks from the old RAID to the new RAID, so I could not do a direct transfer. Also, I needed a way to boot my system independent of any of the RAID systems.
  • The first solution I thought of was to install Windows 8 to a USB Thumb Drive. This is a new feature Microsoft has added to Windows 8, and I found a nice article on a simple way to do this.
  • The first time I tried this it worked fairly well and I was able to boot Windows 8 from the new high-speed ADATA thumb drive I just purchased for this purpose. While running Windows 8 from here was fairly effective, it was still slow as thumb drives are not all that fast. Everything was going along fine, I was able to install the RAID management software and Acronis True Image there. I was almost finished doing all the backups when my system rebooted for mysterious reasons, before my backup was finished.
  • After restarting my system I was no longer able to boot from the ADATA drive, it complained there were no boot records. I tried to repair the boot records, but still no luck. I tried reinstalling Windows, but I still could not boot from it - boot-block problems still. Also, half the time or more, my system did not recognize the ADATA as a boot device. After an insane number of hours playing this stupid game I decided that because the ADATA is actually a USB 3.0 thumb drive, my Intel S5520SC does not play well with it. Consequently, this is not a reliable solution.
  • Then I tried installing Windows 8 to my Patriot thumb drive. While this is a USB 2.0 device, it is also much slower. When I tried to boot Windows 8 the setup took almost 10 times longer than with the ADATA, and many things did not work well because all sorts of Windows processes would time-out. Windows is exceptionally poor when running on/with slow or problematic file systems.
  • While I was still deciding on a boot solution, I tried backup up my old RAID volumes again, so I decided to install Acronis True Image on my main system. Before the installer finished my system 'blue screened' and after restarting it would not boot. I had to go into Windows system repair which started a 'system restore' - I forgot how insanely long a system restore is, over 3 hours. In the end there was a message that the system restore failed. I tried rebooting and it worked. However, my system was extremely unstable and I was having problems with all my storage devices.
  • I got on the phone with Acronis technical support explaining the problem. They had to install a special program to uninstall their software because the regular installer would not work. Then they had to muck around in the registry to remove all the crap their installer put there. Finally they downloaded a newer build of their installer and told me to try that after rebooting. When I tried the new installer, it got a little further, but again blue screened my system. Clearly a pattern here. I had to repeat the software removal experience with technical support yet again, but after that my system was still not stable.
  • After getting my system semi stable again, I tried a variety of other backup solutions, but had trouble with most of them for one reason or another.
  • etc, etc, etc
Here is the solution that ultimately worked, but it was not without its own problems.
  • I simply gave up expecting Windows to boot and run from any USB device, whether flash or rotating disk, none were reliable enough.
    •  I wasted many days of time trying to create a stable, bootable USB drive.
    • I decided I could bootstrap the process by installing Windows 8 on the new array and then using that restore the Gemini system images.
  • Ultimately I found that DriveImage XML was the best product for backing up drive images to external USB devices. In particular I backed up both the Windows System (C:) and System Reserved partitions using DriveImage XML. By Data (D:) partition I simply used a normal Windows file copy to an external drive.
    • I wasted a lot of time because I failed to read the warning that you should run DriveImage XML as Administrator.
    • But this is what happens when you are tired and stressed out, you can easily miss subtle things like this.
    • While I have used imagex many times in the past, I could not get it to back up Gemini because it kept complaining about "a virus or other unwanted software." Typical Microsoft arrogance, they always believe they know how to do your tasks better than you do.
  • I tested reinstalling my system on a partial RAID I had constructed on the new controller. Once I was convinced I could restore a system I disassembled the old RAID system and moved all the disks to the LSI 9266 for a total of 8 disks in a RAID 6 configuration with 4 virtual disks.
    • Dress rehearsals are always important, and I am most proud of the fact I had the forethought to do this before committing myself to a radical system change. Measure twice - cut once ;-)
  • I installed Windows 8 on virtual disk #0. This is basically my (maintenance) system. I then used DriveImage XML from here to restore the system partitions to virtual disk #2, and normal Windows file copy to restore the data partition to virtual disk #3.
    • A good lesson is whenever you build a high end server like Gemini, keep a few extra virtual disks around for maintenance and experimentation
    • While I had installed Windows 8 successfully 6 times or more, suddenly it refused to install with some stupid message about not finding or being able to create partitions. Finally I had to install Windows 7 first, and then install Windows 8 over it. The most import an thing about Microsoft products is that you can NEVER expect them to behave the same way more than once under the same conditions.
  • Finally I used the Windows Setup DVD to copy the boot blocks to virtual disk #2, reconfigured the LSI 9266 to use virtual disk #2 as the boot drive, and booted the newly migrated system for Gemini.
Things seem to be running smoothly now, and Gemini is much faster than before with respect to file systems. Windows is now behaving much better and most of the annoying long pauses seem to have gone away.

Lessons Learned

  1. Windows behaves extremely poorly with a slow or failing file system. In fact, it is completely unbelievable how much a slow or failing file system can cause Windows and its desktop applications like Explorer to freeze for long indeterminate amounts of time.
  2. Always have a bootable backup disk on hand. Avoid Windows PE and RE, and have a full version of Windows available. Install really useful recovery and other maintenance tools.
  3. Modern computer systems have shockingly few ways to boot, especially from external devices. While you can generally used network boot, it is insanely difficult to set up and configure. Forget about booting from FireWire, or USB 3.0. Forget about booting a complex USB 2.0 device with a bridge or disk multiplexor - most, if not all, BIOSes are incredibly simple and stupid. EFI systems might be a bit better, but the EFI support on my S5520SC system board is really messed up. While you can install Windows on a USB Thumb Drive, most are too slow - see point 1.
  4. Windows 8 is better than Windows 7 at self configuring.
  5. Failure is not an option! If at first you don't succeed, take a deep breath, swear, scream, pull you hair out, stamp your feet, have a few drinks, but just keep going. Winston Churchill said "If you are going through hell, just keep going."

Thursday, July 5, 2012

Serious Upgrades

PCI Express Flash Drive

Years ago when I originally conceived Gemini my plan was to use one of the FusionIO devices for my main boot and system drive. While FusionIO kept promising they would release a bootable device, after a couple of years they finally gave up promising and declared that none of their customers really had a need for such a device.

While OCZ has had a similar device for some time that is bootable, when I studied how it operated it never sounded right.

Recently Intel released its Ramsdale series under the Intel 910 product line, and it seemed to be what I was waiting for so I ordered the 800 GB model for $4,000. It was easy enough to install in my computer, but when I finally got it working it appeared as four 200 GB devices. Try as I might, I could find no way to install and operating system on it either.

Finally I started searching the web again for more information and found a few good reviews the pointed out it indeed was not bootable, and it indeed looked like 4 separated devices, with no built-in RAID management. What is surprising is that while each review made these facts clear as day and easy to find, none of the Intel documentation makes any of that clear.

It was easy enough to configure the 4 devices into a single 800 GB Software RAID 0 drive using Windows, but still you cannot boot from a Software RAID. Also, what was really odd according to some of the reviews, the device does not perform any better in RAID 0 operation than in direct operation. Now that is really strange. I wonder how Intel screwed that up?

One good thing is that the new device is pretty fast. Using VMWare Workstation 8 I created a Virtual Disk on the Flash Drive and installed Windows 8. It only takes 11 seconds for Windows 8 to boot from a cold start. By comparison, using a RAM Disk to host the Virtual Disk, I can get Windows 8 to boot in 9 seconds.

Real RAID

I have had so many problems with the built-in RAID on my Intel S5520SC system board that I decided to go buy a high end PCI Express RAID Controller - the LSI MegaRAID SAS 9266-8i.

I also got the Flash backup option. Basically this is a daughter card with Flash memory and a bank of capacitors. The controller has a huge 1 GB RAM cache to improve performance, but if the power fails your file system can be royally hosed. With the flash backup, the capacitors provide power to the controller long enough to copy the RAM cache to flash. When the power comes back, the RAM cache is restored from flash and all the outstanding I/O operations can be completed so that your file system stays sane.

Hierarchical Storage Management

In order to make better use of the Intel 910 more conveniently I decided to experiment with and HSM system called MoonWalk. The idea is that I create a 'source' directory on the RAID 0 Flash Drive and a 'destination' directory on the slower RAID 5 Disk file system. The HSM software will automatically migrate files from the source directory to the destination directory, and leave behind empty stub files in the source directory. If any program tries to access the stub files in the source directory, the HSM software will automatically de-migrate the files back from the destination.

In effect, you can pretend that your source directory is a lot bigger than it really is because files that are not used frequently are migrated to the slower, larger, less expensive disk array. It is convenient because when you want to access the files in the source directory again, they are automatically restored.

Thumb Drive Boot

I was finally able to do something I have wanted for years - boot Windows directly from a USB Thumb Drive. For a long time it was possible to boot Windows PE or Windows RE from a Thumb Drive for installing and/or repairing Windows, but I could never figure out how to actually install the full Windows O/S on the Thumb Drive and boot it, until recently. Finally I found a great article on how to do this with Windows 8 and I was able to get it working.

The great thing about this is that in an emergency, if I have a disk system failure, or if I just want to do maintenance, I can boot the full version of Windows with full functionality from the Thumb Drive.

Wednesday, August 4, 2010

Limping Back to Normality

Well the last month or so has been incredibly stressful with my RAID system problems. Things are looking better now, but I'm not out of the woods.

Last weekend I updated the firmware on my motherboard. Among the things updated was the firmware for the RAID controller. I tried rebuilding the disk on SATA port 5, and yet again it failed after only 12% complete. Using the RAID management software I took the disk offline so I could start the rebuild process again. However when I put the disk online again it did not start rebuilding it simply displayed that the disk was online and there were no errors.

A few hours later everything seemed to be normal so I restarted my system just to make sure this was not a fluke. The system came up and everything seemed normal, except the RAID software showed the there was a background initialization process going on with Virtual Disk 0. According to Intel technical support this means it's scanning the disk for errors and correcting them. After a few hours it finished initializing Virtual Disk 0 and started on Virtual Disk 1.

The next day Virtual Disk 1 was still initializing and only 0% complete. However I noticed many messages in the log file "corrected media error..." Also, the Media Error counter for SATA 5 had changed from 254 to 250 so things seem to be getting better.

According to Intel technical support when I upgraded the firmware it probably corrected some corrupted data in the RAID controller that was causing SATA port 5 to fail. I hope that's the case because it's better knowing you actually did something to fix a problem than the problem mysteriously going away.

I don't know how long it will take to finish initializing Virtual disk 1, but while it's going on my system is running slow, with lots of annoying pauses. This is really frustrating when I spend so much effort to build a system that would be fast.

Friday, July 9, 2010

RAID is for Disk Bugs

Last month I went on vacation to Calgary, and I shut down Gemini for the time. Everything seemed to be working fine when I left, so when I got back from vacation and turned Gemini on again I was surprised to find the system was running really slowly.

On a hunch I suspected there was a disk failure and my RAID 5 system was running degraded. I managed to find the RAID Management Software from Intel and installed it. Sure enough, it confirmed that the array was degraded while it was trying to rebuild disk 5. I came back hours later to find that the rebuild had not progressed past 3%. The RAID software was reporting there were 254 Unrecoverable Medium errors on that drive so I figure maybe there was something seriously wrong with it.

I replace disk 5 with a spare disk and started up the system, and sure enough it started rebuilding this disk. Things seemed to be progressing better, but still it was awesomely slow rebuilding. After several days it was only 10% complete. Then I started getting more of those Unrecoverable Medium Errors. But not only on the newly replaced disk 5, but disk 1 was also starting to report the same errors.

After a week had gone by the rebuild was at 81%, and the progress seemed to be going better so I was hopeful it would finish the next day. Much to my dismay the next day I discovered the rebuild had failed because there were no more entries in the bad-block table.

Storage on a disk drive is divided up into 512 byte units called sectors. Every drive has some spare sectors in case there are unrecoverable errors. In this case the sector with the error is remapped to one of the spare sectors via the bad-block table. Anyway, I ordered a new disk drive to replace the failed drive.

So now it looks like I have two failed disks, and one disk that is slowly failing. If another drive fails I will be royally hooped. In a RAID 5 system, if one disk fails the system will still run. But if two disks fail the whole file system is destroyed.

Saturday, May 15, 2010

Coming Home

All the worrying was not misplaced, some things went well, but others did not.

I finally got Gemini home a couple of weeks ago. CoolIT Systems had her in the shop for about two or three weeks before they finally got to her. They worked on her for a few days and then I got a phone call at work.

The good news was they replaced the whole cooling systems with a new one. They no longer make the WS240, so they replaced it with a couple of ECO units. This is a cooling pump on each CPU, and they are both routed to a 2 x 240 radiator the same as before.

The bad news is that they pulled the motherboard without calling me first. Even though I had said several times in e-mail "don't pull the motherboard without calling me" they went ahead anyway and did it. The problem is they just disconnected all the cables without thinking about how they were going to connect them back. In retrospect I should have taped a note to the side of the chassis -- don't pull the motherboard without calling me.

Most of the cables I did not care about, but there were five cables that connect the disk array to the motherboard - those had to be in the right order. At first I was just in shock, how would I ever get my disk array working again. I had spent six and a half days RIPing my music collection of CDs to my disk, and I did not want to have to do that again.

Now you may be asking why I didn't back up my disk. Well, because it's a RAID 5 array and it's supposed to be reliable. Any one disk can fail and the array will continue to operate. You can replace the failed disk and the array will rebuild itself. Also, I do back up the boot partition, but my music collection is not on the boot partition because the CDs themselves are the backup.

When I got to thinking straight I realized that I had connected the drives in a specific order, and there were really only 4 different combinations I would have chosen. I explained this to CoolIT and they said they would try to get it running again. A couple of days later they said they had it running again and I was able to breath a deep sigh of relief. They packed up the system and shipped it back to me, and it got to me Tuesday, May 4.

When I got Gemini out of the box and onto the dining room table I took off the side panel to inspect the job. The cooling system looked OK, but I was surprised how they rewired everything. The fan wiring did not make sense to me, but it must have worked at one point because they said they tested the system. Also, two of the chassis fans were not connected to anything. The audio cable from the front of the chassis was connected to a fan header - I don't know what they were thinking - and a few cables were not connected to anything because they could not figure out where the connections went.

It did not take much effort to fix all the connections and I powered up the system a couple of times just to make sure all the fans were running. Then I put the side panel back on and took Gemini upstairs back to her nest in my office. She's the biggest thing in the nest mind you.

When I started her up fully for the first time there were three strange beeps after the Power On Self Test (POST) but then Windows started booting and everything seemed OK. However, Windows was unusually sluggish for some reason. I then rebooted Gemini from the Intel Deployment CD to run the RAID utilities. It appeared that they had not actually connected the disks back in the right order, but it seems the RAID system didn't really care, it just figured out which disk was which and carried on. It seems that it had to spend a little time rebuilding things, which may explain why the system performance was slow for a while, but after a while everything was fine again.

A few days latter I was curious about the new three beeps every time I started the system so I went back to the Intel manual. Evidently they were telling me there was a memory error. I checked the system properties and sure enough the system could only see 10 GB of the 12 GB of RAM I had installed. After closer inspection of the motherboard I could see a tiny LED that was on in a row of 6 LEDs. According to the Intel manual that indicated which memory module was at fault. I pulled the module and reseated it carefully, powered on the system, and all was well again.

Since then Gemini has not been giving me any problems - consequently I'm in a much better frame of mind - and I feel like I've learn a bunch of new things I did not know before.

I don't think CoolIT were incompetent or anything. I knew ahead of time they were busy and even though they told me they did not do this kind of custom work any more, they took responsibility to address a product they sold me that was failing. I deliberately did not put any pressure on them to hurry, but after the third inquiry in three weeks I think they panicked a little to get going and were just too enthusiastic about getting things done. To their credit they also took responsibility to resolve the problems they had created. I would certainly do business with them again.

A happy little side note: I ran the Window System Rating utility and for some reason my Graphics is running faster than before: 7.5 instead of 7.1. I think this is probably because I updated the ATI drivers right before I sent Gemini away for repair.

Friday, April 9, 2010

Worrying While the Children Are Away

Last weekend I sent Gemini back to CoolIT Systems in Calgary to have the water pump on the WS240 replaced.

Overall it was quite a stressful exercise because I sent the system back completely assembled. The previous time I sent the case back for correcting the problems with the WS240 configuration, it was just the case and cooling systems. This time the motherboard, processors, disks, graphics card, etc are all there. The system was in working condition when I sent it - I really hope it comes back that way.

Why worry? Because there is a lot that go wrong. I don't exactly know what is involved in replacing the water pump, only that I'm not really equipped or qualified to do it myself. The WS240 is a self contained unit and it was not really intended that the customer be able to replace parts or modify it.

One problem is that if they try to replace the water pump in the case, cooling fluid could leak into the system and damage something. The water pump sits on top of one of the CPUs.

If they have to removed the water-block the pump sits on from the CPU, then there is the issue of reattaching it. I used some pretty special thermal paste for bonding the CPU to the water blocks and requires special preparation.

If they have to remove the entire WS240 then they will have to pull the motherboard. If they pull the motherboard there are all kinds of connections that have to be put back in the right place. In particular the SATA cables to the disk drives all have to go back to the right disk drive or the RAID configuration will be toast.

I suppose one reason people worry about their children so much is that they have invested so much in raising them. It's a little bit like that for me too, I conceived Gemini: researched all the system components, architected a system design, build the system, went through the birthing process, multiple times in fact as Gemini has been reborn several times. Gemini is a problem child to say the least - I've seen so many things go wrong that it's all too easy for me to imagine so many other things that can go wrong.

So it's a little like sending your child on a two week trip to a special hospital to treat a special disease. All I can do is worry and hope that all turns out well in the end.