Wednesday, August 4, 2010

Limping Back to Normality

Well the last month or so has been incredibly stressful with my RAID system problems. Things are looking better now, but I'm not out of the woods.

Last weekend I updated the firmware on my motherboard. Among the things updated was the firmware for the RAID controller. I tried rebuilding the disk on SATA port 5, and yet again it failed after only 12% complete. Using the RAID management software I took the disk offline so I could start the rebuild process again. However when I put the disk online again it did not start rebuilding it simply displayed that the disk was online and there were no errors.

A few hours later everything seemed to be normal so I restarted my system just to make sure this was not a fluke. The system came up and everything seemed normal, except the RAID software showed the there was a background initialization process going on with Virtual Disk 0. According to Intel technical support this means it's scanning the disk for errors and correcting them. After a few hours it finished initializing Virtual Disk 0 and started on Virtual Disk 1.

The next day Virtual Disk 1 was still initializing and only 0% complete. However I noticed many messages in the log file "corrected media error..." Also, the Media Error counter for SATA 5 had changed from 254 to 250 so things seem to be getting better.

According to Intel technical support when I upgraded the firmware it probably corrected some corrupted data in the RAID controller that was causing SATA port 5 to fail. I hope that's the case because it's better knowing you actually did something to fix a problem than the problem mysteriously going away.

I don't know how long it will take to finish initializing Virtual disk 1, but while it's going on my system is running slow, with lots of annoying pauses. This is really frustrating when I spend so much effort to build a system that would be fast.

Friday, July 9, 2010

RAID is for Disk Bugs

Last month I went on vacation to Calgary, and I shut down Gemini for the time. Everything seemed to be working fine when I left, so when I got back from vacation and turned Gemini on again I was surprised to find the system was running really slowly.

On a hunch I suspected there was a disk failure and my RAID 5 system was running degraded. I managed to find the RAID Management Software from Intel and installed it. Sure enough, it confirmed that the array was degraded while it was trying to rebuild disk 5. I came back hours later to find that the rebuild had not progressed past 3%. The RAID software was reporting there were 254 Unrecoverable Medium errors on that drive so I figure maybe there was something seriously wrong with it.

I replace disk 5 with a spare disk and started up the system, and sure enough it started rebuilding this disk. Things seemed to be progressing better, but still it was awesomely slow rebuilding. After several days it was only 10% complete. Then I started getting more of those Unrecoverable Medium Errors. But not only on the newly replaced disk 5, but disk 1 was also starting to report the same errors.

After a week had gone by the rebuild was at 81%, and the progress seemed to be going better so I was hopeful it would finish the next day. Much to my dismay the next day I discovered the rebuild had failed because there were no more entries in the bad-block table.

Storage on a disk drive is divided up into 512 byte units called sectors. Every drive has some spare sectors in case there are unrecoverable errors. In this case the sector with the error is remapped to one of the spare sectors via the bad-block table. Anyway, I ordered a new disk drive to replace the failed drive.

So now it looks like I have two failed disks, and one disk that is slowly failing. If another drive fails I will be royally hooped. In a RAID 5 system, if one disk fails the system will still run. But if two disks fail the whole file system is destroyed.

Saturday, May 15, 2010

Coming Home

All the worrying was not misplaced, some things went well, but others did not.

I finally got Gemini home a couple of weeks ago. CoolIT Systems had her in the shop for about two or three weeks before they finally got to her. They worked on her for a few days and then I got a phone call at work.

The good news was they replaced the whole cooling systems with a new one. They no longer make the WS240, so they replaced it with a couple of ECO units. This is a cooling pump on each CPU, and they are both routed to a 2 x 240 radiator the same as before.

The bad news is that they pulled the motherboard without calling me first. Even though I had said several times in e-mail "don't pull the motherboard without calling me" they went ahead anyway and did it. The problem is they just disconnected all the cables without thinking about how they were going to connect them back. In retrospect I should have taped a note to the side of the chassis -- don't pull the motherboard without calling me.

Most of the cables I did not care about, but there were five cables that connect the disk array to the motherboard - those had to be in the right order. At first I was just in shock, how would I ever get my disk array working again. I had spent six and a half days RIPing my music collection of CDs to my disk, and I did not want to have to do that again.

Now you may be asking why I didn't back up my disk. Well, because it's a RAID 5 array and it's supposed to be reliable. Any one disk can fail and the array will continue to operate. You can replace the failed disk and the array will rebuild itself. Also, I do back up the boot partition, but my music collection is not on the boot partition because the CDs themselves are the backup.

When I got to thinking straight I realized that I had connected the drives in a specific order, and there were really only 4 different combinations I would have chosen. I explained this to CoolIT and they said they would try to get it running again. A couple of days later they said they had it running again and I was able to breath a deep sigh of relief. They packed up the system and shipped it back to me, and it got to me Tuesday, May 4.

When I got Gemini out of the box and onto the dining room table I took off the side panel to inspect the job. The cooling system looked OK, but I was surprised how they rewired everything. The fan wiring did not make sense to me, but it must have worked at one point because they said they tested the system. Also, two of the chassis fans were not connected to anything. The audio cable from the front of the chassis was connected to a fan header - I don't know what they were thinking - and a few cables were not connected to anything because they could not figure out where the connections went.

It did not take much effort to fix all the connections and I powered up the system a couple of times just to make sure all the fans were running. Then I put the side panel back on and took Gemini upstairs back to her nest in my office. She's the biggest thing in the nest mind you.

When I started her up fully for the first time there were three strange beeps after the Power On Self Test (POST) but then Windows started booting and everything seemed OK. However, Windows was unusually sluggish for some reason. I then rebooted Gemini from the Intel Deployment CD to run the RAID utilities. It appeared that they had not actually connected the disks back in the right order, but it seems the RAID system didn't really care, it just figured out which disk was which and carried on. It seems that it had to spend a little time rebuilding things, which may explain why the system performance was slow for a while, but after a while everything was fine again.

A few days latter I was curious about the new three beeps every time I started the system so I went back to the Intel manual. Evidently they were telling me there was a memory error. I checked the system properties and sure enough the system could only see 10 GB of the 12 GB of RAM I had installed. After closer inspection of the motherboard I could see a tiny LED that was on in a row of 6 LEDs. According to the Intel manual that indicated which memory module was at fault. I pulled the module and reseated it carefully, powered on the system, and all was well again.

Since then Gemini has not been giving me any problems - consequently I'm in a much better frame of mind - and I feel like I've learn a bunch of new things I did not know before.

I don't think CoolIT were incompetent or anything. I knew ahead of time they were busy and even though they told me they did not do this kind of custom work any more, they took responsibility to address a product they sold me that was failing. I deliberately did not put any pressure on them to hurry, but after the third inquiry in three weeks I think they panicked a little to get going and were just too enthusiastic about getting things done. To their credit they also took responsibility to resolve the problems they had created. I would certainly do business with them again.

A happy little side note: I ran the Window System Rating utility and for some reason my Graphics is running faster than before: 7.5 instead of 7.1. I think this is probably because I updated the ATI drivers right before I sent Gemini away for repair.

Friday, April 9, 2010

Worrying While the Children Are Away

Last weekend I sent Gemini back to CoolIT Systems in Calgary to have the water pump on the WS240 replaced.

Overall it was quite a stressful exercise because I sent the system back completely assembled. The previous time I sent the case back for correcting the problems with the WS240 configuration, it was just the case and cooling systems. This time the motherboard, processors, disks, graphics card, etc are all there. The system was in working condition when I sent it - I really hope it comes back that way.

Why worry? Because there is a lot that go wrong. I don't exactly know what is involved in replacing the water pump, only that I'm not really equipped or qualified to do it myself. The WS240 is a self contained unit and it was not really intended that the customer be able to replace parts or modify it.

One problem is that if they try to replace the water pump in the case, cooling fluid could leak into the system and damage something. The water pump sits on top of one of the CPUs.

If they have to removed the water-block the pump sits on from the CPU, then there is the issue of reattaching it. I used some pretty special thermal paste for bonding the CPU to the water blocks and requires special preparation.

If they have to remove the entire WS240 then they will have to pull the motherboard. If they pull the motherboard there are all kinds of connections that have to be put back in the right place. In particular the SATA cables to the disk drives all have to go back to the right disk drive or the RAID configuration will be toast.

I suppose one reason people worry about their children so much is that they have invested so much in raising them. It's a little bit like that for me too, I conceived Gemini: researched all the system components, architected a system design, build the system, went through the birthing process, multiple times in fact as Gemini has been reborn several times. Gemini is a problem child to say the least - I've seen so many things go wrong that it's all too easy for me to imagine so many other things that can go wrong.

So it's a little like sending your child on a two week trip to a special hospital to treat a special disease. All I can do is worry and hope that all turns out well in the end.

Thursday, March 4, 2010

Plumbing Problems

“If I had my life to live over again, I’d be a plumber.”
- Albert Einstein

Just when you think all your troubles are behind you... A couple of weeks ago Gemini started making sort of an intermittent ticking sound. I wasn't sure what it was, I thought maybe one of the disk drives. The problem seemed to be getting worse, but sometimes it would go away, so I ignored it for a while.

This week the sound is now constant, and does not let up - it's a ticking, buzzing sound that is really annoying. At first I thought it was a fan so I checked the inside of my computer. There seemed to be a lot of dust collecting in some of my fans and the radiator for my cooling system. I got out the vacuum cleaner and got rid of as much dust as possible, but that didn't solve the problem. Then I tried stopping the various fans to see which one it could be, but again the sound would not go away.

Finally I disconnected the power plug for the pump in my cooling system and the sound went away. The good news is I know the source of the problem, the bad news is that this is a really difficult thing to fix. It will likely result in me having to completely dismantle Gemini, ship it back to Calgary for them to replace the pump, ship it back to me, and I will have to completely rebuild Gemini from square one. That's an enormous amount of really stressful work.

I'm so not looking forward to this as it also means I will be without Gemini for two weeks or more.

Friday, February 26, 2010

Smoother Sailing

Well for the first time in many months I have nothing in particular to rant about - Gemini has been mostly behaving. Hopefully most of the bad weather is behind me now, and I am finally beginning to enjoy having a top-of-the-line computer more than worrying about it.

This month I took some time to do some recreational programming. Yah, some of us do it for fun. I'm trying to learn a new programming language called Scala. Java has been my favorite programming language for over 15 years now, so now I'm wondering what else is out there to try, will there ever be anything better than Java?

Java was such a revolutionary change from what we had before it has now become the most popular programming language in the world. It has had such an impact on technology that James Gosling (the creator of Java) was made an Officer of Canada. However, Java is 15 years old and the language is not evolving as fast as other languages (like C#). It is incredibly hard to add new features to the language (like closures) because the keepers of the language design are very conservative. Also there is the whole Java Community Process that governs changes to the language and the platform - which is like the biggest committee in the world.

A few years ago a new language came on the scenes - Scala. It was developed as a research project to combing functional and object oriented programming paradigms, as well to as try out a lot of new ideas in programming language design. My first look at it was pretty scary - I could not understand it, and the programs were hard to read. I attended a presentation on Scala while I was at one of the JavaOne conventions - but I left still confused about the language - so much of it was over my head.

Not wanting to be too old a dog to learn new tricks I kept trying to understand Scala until I finally found a couple of good books on the language. Everything else I had been reading always seemed over my head. Finally things started making sense enough for me to try writing some simple programs in Scala and the more I played with the the more I began to like Scala.

One of the reasons I built the computer I did - 2 processors - 8 cores - 16 threads - was so I could experiment with concurrent programming -- that is, programs that are able to make use of all those cores and threads. I took one of the more simple programs I love to play with, prime numbers, and decided I would write a version the could make use of all those threads.

One of the cool things about Scala is that it is designed to handle concurrent programming well. This gave me a great chance to play around with two new things at once - concurrent programming and Scala. I had a few problems along the way, but after a few weeks of trial and error I finally got my program working - which is able to use 90% of the power of my computer. All in all it was a very satisfactory experience, reminding me very much of what first attracted me to computers when I was 12 years old - trying to figure out how to solve a problem using a computer.

In the future I hope to find the time for more 'recreational programming' - time to just play and experiment with new things and ideas. It's a real thrill doing things that you just can't do on any old computer.

Sunday, January 31, 2010

If At First You Don't Succeed...

The Definition of Insanity: doing the same thing over and over again and expecting different results. -- Albert Einstein

When dealing with Microsoft products the definition of insanity is: doing the same thing over and over again and expecting the same results.

I thought I was being clever when I bought software to change may partition sizes so I would not have to reinstall Windows 7 yet again... But a few days after resizing my C: partition and the hidden System Reserved partition I noticed that the system could not recognize my optical drive any more. I tried several times uninstalling the device driver and reinstalling it, but every time I tried to reinstall it Windows reported that the driver was corrupted. Interestingly enough I was able to boot from the optical drive because I used the Windows 7 Setup DVD to try to repair the operating system - unfortunately that did not fix the problem.

Then I also notice that my printers stopped working and I could not seem to fix them. I was also having problems with the print spooler and had to keep restarting it.

Finally I tried restoring the system image from my previous backup. That seemed to fix the printer problem, but not the optical drive problem. I figured that during the resizing of my partitions there was probably some file system damage - given that it took over 25 hours of solid disk activity to perform the operation. There seemed nothing left to do but to do a clean install of Windows 7, yet again.

Unfortunately I had invited my sisters and nephews over to finally see Gemini - but I had to cancel because so many things were not working right, and I needed a full weekend to work on getting Gemini working again.

Overall the installation went fairly smoothly - no real surprises. I was very methodical, taking notes at every step. When I past all the Windows updates I started installing drivers and applications. The first application I installed was my BitDefender security software because I wanted to make sure that while I was installing the rest of my applications I was not introducing any malware into the system.

Next I stated attaching devices to Gemini. I have gotten into the habit of disconnecting all my devices: USB drive, printers, TV-tuner, etc because it makes the update process slower and more stable. Each time I did something I kept checking the state of the Optical Drive - and it seemed OK.

I bought three computer games after Christmas: Dragon Age Origins, Mass Effect, and East India Company. All three seemed to install properly, but Mass Effect and East India Company would not run because of some error that told me to wait for a future update. Interestingly enough a week later Windows automatically downloaded an updated version of Mass Effect and that seem to fix that problem. I still haven't tried East India Company yet.

A week later I've got most things installed again and restored my personal files. So far nothing seems to be persistently wrong - aside from the usual Windows networking that never works right. In the case of Windows it's netbroken. Anyway, I'm really hoping this is the last time I have to install Windows 7 again. I've gotten really good at it, but it usually takes a whole weekend of my time - time I could better spend doing fun things.

Cheers, Eric

Tuesday, January 19, 2010

A Fresh Start

The man who sets out to carry a cat by its tail learns something that will always be useful and which never will grow dim or doubtful. -Mark Twain

I thought I had everything sorted out properly on Gemini and reinstalling Windows was now something that was behind me - but boy what I ever wrong.

An ongoing issue I've been having with Windows 7 is that the search function just doesn't work properly. Apple and Google have figured out how to do simple effective search, but this technology, or even understanding the issue, is far beyond Microsoft.

I found a forum discussion on Microsoft's site where people were complaining that search did not work properly in Windows 7. Unfortunately someone had closed the discussion, so I created a new discussion with the same title. It was very popular as people still had a lot of complaints on this issue. People had many suggestions on things to try, and I tried many of them but none ever solved the simple case I was trying to get working.

Recently someone posted a procedure he claimed solved the problem. Part of the procedure involved changing ownership of all the files and folders on my C: drive from the systems to belong to my account. I knew this sounded dangerous and I should have followed my gut and stopped, but I didn't . After completing the procedure it still did not fix my search problem, but worse, it destabilized my entire system. Many of my device drivers started failing, and I was not able to reinstall any of them. My graphics driver got totally hosed so I removed it, and could not reinstall it. As a result I could not run any 3D graphics any more.

I tried restoring my system to an earlier point before the trouble started. Windows restore takes a long long time to work - stupidly long - and it didn't fix the problem.

Eventually I figured the issue was so bad the only thing left to do was reinstall Windows. The problem was I have moved everything over from my old computer to Gemini, and installed many new applications and got them configured. I could think of no other solution than making a clean start.

Last Saturday I bit the bullet. I backed up all my user data to my external drive (used for backups) and went to reinstall windows. I've done this so many times before that you would think this part would be easy - not! One of the things I was trying to do was create my C: drive with a 64 KB cluster size. This makes the file system blocks bigger than the standard 4 KB ones so that it's faster to access. It wastes more space, but I have so much space I don't really care. Unfortunately Microsoft does not make this process easy as it's not a standard setup. One thing I had to be careful of was that my System Reserved partition was at least 200 MB so that backup would work.

It took me 3 tries at installing Windows to finally get things right and I spent the rest of the day restoring all the user files I had backed up, and installing my applications over again. Unfortunately when I went to back up my system the backup failed again. Somehow in all the confusion of installing Windows 3 times my System Reserved partition got set back to 100 MB - this is a bug in the Windows Setup by the way.

I didn't want to reinstall Windows yet another time so I thought I would just change the partition sizes. Unfortunately while Windows has tools to change the partition sizes, it does not have the tools to change them way I needed. I checked the Norton site to see if I could get a copy of Partition Magic, but they don't offer that product any more. Then I downloaded a demo copy of Acronis Disk Director (because I've heard so many good things about it). It appeared to do exactly what I wanted, but the demo version would not complete the operation. Next I paid for the complete version, but it would not install because Windows 7 is not supported. You would have thought the demo version would have told me that - sheesh. Finally I found a product called EASEUS Partition Master that worked under Windows 7.

Yesterday Morning at 5:00 AM I changed the partition sizes to what I needed. This operation has to be performed before Windows actually boots, so I rebooted my system to see what would happen. The process started and there was a nice little display showing the progress, but the progress was really slow, so I left it running and went to work. When I got home after work the process was still not finished, it was going incredibly slow. Finally at 6:00 AM this morning the process finally finished and my system rebooted. I checked the partition sizes, and they were finally set correctly. I tried doing a backup of Gemini and it worked this time.

The moral of the story is when people offer you advice in a forum on how to fix something on your system - take the advice with a great deal of caution - especially if it sounds like something dangerous.

Cheers, Eric