domenica 1 luglio 2012

When a RAID1 fails on your critical server

As I already say: harddisk fails. Even branded SCSI HDD 15krpm failes. Of course in the worst way, the worst day.

I was enjoin my Saturday afternoon at home with my 2 years old baby when I look at my "sysadmin" mobile (I have one just to have monitoring alarms and other notification)
One message: not so bad, today is the "full backup" day, so system will be heavily loaded when updating the backup DB, a warning about cpu usage is usual, but..

"*** PROBLEM Service Alert: xxxxx RAID status is CRITICAL"

xxxxx is my "main" server, which serves most of the non-development services: DNS, ERP, DB server, wiki, administrative storage..
I connect immediately to my monitoring system via web browser to know more details about the failure. The RAID1 that fails is md0.. CRITICAL means that one of it's device has gone offline.
Of couse, md0 is the root device.. even worst, of the two hdd that compose my failed RAID1, the one that fails also have a partition that's used as swap.
What does this means? Filesystem is ok (RAID1 is degraded by functional) but processes goes crazy due the fact that cannot access their memory, if it's swapped-out.
Of course the main processes (oracle, mysql, apache, java) have some pages swapped, so the are locked.
But Linux is strength and I can access the server via SSH and do some useful things to prevent more system corruption. I kill some CPU intensive processes, remount all fs as read-only and, finally, try to reboot.
Well.. reboot is not working. Even init has some pages on swap, and it can't do it's work.
I got to go into the server room.
By pressing the power button I was able to turn off the failed server.

Luckily I got some used spare part (the server is pretty old, it's hard to find new spare part and they cost too much): two SCSI disks (larger than the one that fails, fortunately) are perfect.
After thinking about substituting the failed harddisk (which, of course, also holds GRUB MBR) I choose a different way.
The failed harddisk is not completely broken, it just fails a few SCSI transaction (probably due heavy swap usage) and SCSI stack kick it out of its stack.
To me I can still use the disk, at least for boot and nothing more. So I add the spare HDD to an empty slot, turn on the server and cross my fingers.
Everything boots fine! yeah!
After that I partition the spare HDD pretty much like the failed one, plus a bigger swap space and:

sudo swapon /dev/sde2 #the spare part
sudo swapoff /dev/sda2 #the failed one

A bit of change of /etc/fstab to apply the settings on next reboot.
I also add the spare part to the RAID1:

sudo mdadm --manage /dev/md --add /dev/sde2

Now cat /proc/mdadm says that it's rebuild the array, the CRITICAL state now says WARNING.
After half an hour of reconstruction, WARNING turns into OK.
For sure I'll have to do some maintenance on this server, but not before enjoin the rest of Saturday an the whole Sunday!!!

At the end I was lucky: my monitoring system works well, my used spare parts are useful and Linux so well structured that a major issue like this has been resolved before my pizza becomes cold but a few things has to be reminded:

  1. always has some spare part for your critical server
    • harddisk of the same technology (SATA, SCSI, SAS) at least bigger that the one you are using but still compatible with your hw/sw stack
    • power supply (specially if it's not ATX compatible)
    • RAM, even if it's not so critical. Usually a server can work without some memory bank, but it's better to have
    • the best is to have a perfect clone of you working machine, turned off and ready to turn on or be sacrifice to give spare part to the working one. This cannot usually be done (due budget limits) but can be done easily for old server and used hardware (e.g. on ebay). Be sure to heavily test used hardware you purchase, before says that it can be used for spare parts!
  2. always have a monitoring system. It's better to know that something fails Saturday evening, when nearly no one is working, that knowing on Monday morning after the first users notice that "there's something wrong"
  3. nearly everything, in hw and sw, should be redundant. It's not so good to have RAID1 for file system, when a failure on swap device hang your server! Also bootloader should be redundant: if you have mirrored the boot device (which usually has the root file system) you should also mirror bootloader (e.g. grub) installation on both mirror device
  4. be prepared and check periodically your monitoring, recovery system and spare parts