lunedì 3 dicembre 2012

Don't let Ubuntu kill your laptop HDD!

A simple question: which "life style" make your (laptop) hard disk last longer?
  1. keep it running all of its life. No matter if you need to read it or not
  2. turn it off (well.. spindown) as soon as possible. Turn it on again as needed
If you have choose 2) well.. be prepared. You are silently killing your HDD!
Spinup, in fact, is quite expensive for and harddrive and, by design, it has been tested only for a limited number of spindown-up sequence.
So, if you don't need to strictly preserve your laptop battery, please leave it turned on!

Why Ubuntu can be a HDD serial killer?
Because, by design, it spins down your drive quite often. Open a terminal and try the following command line

 sudo smartctl -a /dev/sda | grep -i "Start_Stop_Count\|Load_Cycle_Count\|Power_On_Hours"
  4 Start_Stop_Count        0x0012   097   097   000    Old_age   Always       -       5238
  9 Power_On_Hours          0x0012   099   099   000    Old_age   Always       -       719
193 Load_Cycle_Count        0x0012   095   095   000    Old_age   Always       -       55359


As you can see, in my 1-year old laptop, I got nearly 720 hours of work but 5238 (!!!) power sequence and 55359 load cycle.
It's like I've turned on (or put in standby) my laptop 15 times at day, each day (even Christmas) for the last year!

How to change this?
I've found this solution, which is also useful for manage your battery consciously
  • install laptop-mode tools
sudo apt-get install laptop-mode-tools
  • edit /etc/laptop-mode/laptop-mode.conf with your preferred application ad look for BATT_HD_POWERMGMT settings. If it's 1 be prepared for early disk death!
  • change to a reasonable value (like 100, 254 is the highest while 255 disable power management at all). man hdparm  at -B switch will help you in understanding this value
  • save the file and type
sudo service laptop-mode reload

This will apply the changes.
Just to be sure, type

sudo hdparm -B /dev/sda

To check the value applied.

Long life to your laptop's HDD!

giovedì 16 agosto 2012

Synology NAS Disaster Recovery


Good Sysadmin should do a good work especially for the worst things.

One of the things that usually new Sysadmin understand only after they already happens is Disaster Recovery.

Sooner or later, no matter how you work well or how much you pay for your hardware, something will fails or go wrong.
It may be a disk or a power supply, that's not so bad: if you are just something better that a young student you always use at least RAID1 and redundant power supply. Just hot/cold plug your spare part (because.. you have spare part right under your desk, don't you?) and you're done, back to your desk.

What if something worst happens? Don't thing about fire or earthquakes: if you have only one site and everything burs or fall down, well.. no one will work anyway even if you can restore your mail server in just half an hour!

Let's say that you have a NAS where you store all your server data.
What if it's motherboard fails?
Well.. I know that is quite hard that a CPU or chipset die nowadays but it can happen.
In this way you have all your redundant power supply and RAID5+hot spare disks that cannot be accessible..

The easy and costly solution is to have two of them: plug the disk from one to the other, restore the configuration and you're done: 10 min of downtime!
But it costs too much for something that probably will never happens in 10 years.

I face this problem a few days ago, when I decide to replace most of my storage with a centralised NAS. A SAN costs too much for my business, so my hardware vendor suggest me a good NAS: a Synology RackStation RS2212RP+.
That's a 10 SATA bay NAS, double Gigabit Ethernet with aggregation, redundant power supply and some really cool additional feature.
One of the coolest is that... it's Linux based!!!
What can you do with a Linux based business-line NAS without hardware RAID controller, in case of disaster recovery?
Just substitute it with a standard Linux PC until you receive the new NAS!!

It's pretty easy, if you know at least a bit of Linux administration via command line, to replace (at least for basic feature) this NAS. It will have a 0$ impact on your (always too small) IT budget (for sure you have a spare PC with 2-3 SATA port somethere.. don't throw it away too soon!!) and you can recovery your data in less than half an hour (if you are prepared!)

In my setup I have:
  1. 4 HDD configured with RAID5 + Hot Spare. In this way 2 disk are enough to have a working RAID
  2. Over this RAID5 I build a disk group, which allow me to create more that one iSCSI LUN and/or more than one NFS/SAMBA shares
  3. I also create 1 LUN (and 1 iSCSI target) plus 1 volume shared via NFS  
This is a quite complex setup, and, probably, the most complex you can do with this Synology NAS. Less complex configuration (RAID1 or single volume instead disk group) will require less work.
This is what I've done when simulating a really bad hardware failure of my brand new NAS:
  • plug 2 of the NAS disk (not the hot spare, of course!) in an empty Linux box. I used a standard Dell desktop, without any disk.
  • put a Ubuntu 12.04 LTS Desktop USB dongle to boot from a Live distro
  • once the boot is complete I need to get some more packages, which are not installed by default in the desktop edition:
sudo apt-get install mdadm lvm2

That's because I'll need to work with software RAID and LVM.

  • Now, a bit of scanning to find the RAID device

mdadm --assemble --scan

The above command will scan physical disk to find an already created array. It will work automatically. With just 2 disk the array is degraded but data is accessible.

  • It's time for LVM! Fortunately Synology guys didn't create something weird and custom but uses all Linux power to create the required flexibility of their "Disk Group" feature. Scanning for LVM is something like what I've done for RAID
  1. sudo pvscan
  2. sudo vgscan
  3. sudo vgchange -a y 
  4. sudo lvscan
Which
  1. scan for physical volume (PV)
  2. scan  PV to look for volume group (VG). This operation take a bit of time, but no more than a minute on my 2TB disks
  3. enable finded VG
  4. scan VG to look for logical volume (LV)
LV are the end of LVM stuff, if you are new to LVM let's say that it's like having a standard disk partition.
If you have just create "volumes" (speaking in Synology term) to export via NFS/samba, you're done. Just mount the LV and access your data!

sudo mount /dev/vg1/nfs_share /mnt/nfs

If you want to access iSCSI data you have two choice:
  1. configure the iSCSI target on your Linux box to export LV and mount it on a client. This is a bit longer and it's outside the scope of this article. Take a look for example here for a really good tutorial.
  2. mount the partition locally.
You cannot, AFAIK, mount the LV directly if it has been uses as iSCSI Target. In fact the client (initiator, in iSCSI terms) will see the Target as a disk, so it will, at least, build a partition table on it. So, before mount, you must scan for a partition table and create the right block devices. Nowadays this is pretty easy:

sudo kpartx -a -v /dev/vg1/iscsi_0

The above command will scan the LV and, if a partition table is found, create the corresponding devices. In my simple test-bed I had only one partition, so I can just mount it to have my files back on-line

sudo mount /dev/mapper/iscsi_0p1 /mnt/iscsi

That's all folks!
Now, order a new Synology NAS and replace the unlucky one!


domenica 1 luglio 2012

When a RAID1 fails on your critical server

As I already say: harddisk fails. Even branded SCSI HDD 15krpm failes. Of course in the worst way, the worst day.

I was enjoin my Saturday afternoon at home with my 2 years old baby when I look at my "sysadmin" mobile (I have one just to have monitoring alarms and other notification)
One message: not so bad, today is the "full backup" day, so system will be heavily loaded when updating the backup DB, a warning about cpu usage is usual, but..

"*** PROBLEM Service Alert: xxxxx RAID status is CRITICAL"

xxxxx is my "main" server, which serves most of the non-development services: DNS, ERP, DB server, wiki, administrative storage..
I connect immediately to my monitoring system via web browser to know more details about the failure. The RAID1 that fails is md0.. CRITICAL means that one of it's device has gone offline.
Of couse, md0 is the root device.. even worst, of the two hdd that compose my failed RAID1, the one that fails also have a partition that's used as swap.
What does this means? Filesystem is ok (RAID1 is degraded by functional) but processes goes crazy due the fact that cannot access their memory, if it's swapped-out.
Of course the main processes (oracle, mysql, apache, java) have some pages swapped, so the are locked.
But Linux is strength and I can access the server via SSH and do some useful things to prevent more system corruption. I kill some CPU intensive processes, remount all fs as read-only and, finally, try to reboot.
Well.. reboot is not working. Even init has some pages on swap, and it can't do it's work.
I got to go into the server room.
By pressing the power button I was able to turn off the failed server.

Luckily I got some used spare part (the server is pretty old, it's hard to find new spare part and they cost too much): two SCSI disks (larger than the one that fails, fortunately) are perfect.
After thinking about substituting the failed harddisk (which, of course, also holds GRUB MBR) I choose a different way.
The failed harddisk is not completely broken, it just fails a few SCSI transaction (probably due heavy swap usage) and SCSI stack kick it out of its stack.
To me I can still use the disk, at least for boot and nothing more. So I add the spare HDD to an empty slot, turn on the server and cross my fingers.
Everything boots fine! yeah!
After that I partition the spare HDD pretty much like the failed one, plus a bigger swap space and:

sudo swapon /dev/sde2 #the spare part
sudo swapoff /dev/sda2 #the failed one

A bit of change of /etc/fstab to apply the settings on next reboot.
I also add the spare part to the RAID1:

sudo mdadm --manage /dev/md --add /dev/sde2

Now cat /proc/mdadm says that it's rebuild the array, the CRITICAL state now says WARNING.
After half an hour of reconstruction, WARNING turns into OK.
For sure I'll have to do some maintenance on this server, but not before enjoin the rest of Saturday an the whole Sunday!!!

At the end I was lucky: my monitoring system works well, my used spare parts are useful and Linux so well structured that a major issue like this has been resolved before my pizza becomes cold but a few things has to be reminded:

  1. always has some spare part for your critical server
    • harddisk of the same technology (SATA, SCSI, SAS) at least bigger that the one you are using but still compatible with your hw/sw stack
    • power supply (specially if it's not ATX compatible)
    • RAM, even if it's not so critical. Usually a server can work without some memory bank, but it's better to have
    • the best is to have a perfect clone of you working machine, turned off and ready to turn on or be sacrifice to give spare part to the working one. This cannot usually be done (due budget limits) but can be done easily for old server and used hardware (e.g. on ebay). Be sure to heavily test used hardware you purchase, before says that it can be used for spare parts!
  2. always have a monitoring system. It's better to know that something fails Saturday evening, when nearly no one is working, that knowing on Monday morning after the first users notice that "there's something wrong"
  3. nearly everything, in hw and sw, should be redundant. It's not so good to have RAID1 for file system, when a failure on swap device hang your server! Also bootloader should be redundant: if you have mirrored the boot device (which usually has the root file system) you should also mirror bootloader (e.g. grub) installation on both mirror device
  4. be prepared and check periodically your monitoring, recovery system and spare parts

domenica 24 giugno 2012

How to re-install a WinXP downgrade from Seven (in 2012)

Harddisk fails, sometimes without any warning they just "die". Yesterday everything was fine and today, when you turn on yor workstation, is simply does not boot.
Fortunately, this time the user was smart and recognize that something was wrong: he report to me that his workstation was pretty slot, taking longer time to boot everyday. This slowness seems to be cause by huge activity on disk. Well.. WinXP (as all windows version) have the strange habit that they become slower when time passes.. this is ok if you're a human a get closes to 60 but a bit weird (IMOH) if you are a 2/3 years old PC.
Having smartmontools on my Sysadmin Swiss Knife USB key, I just run "smartctl -t short sda" (this is a WinXP machine, but smartmontool are available for Win too and the work pretty well there too). The tool reports and error on a sector: usually this means that the drive will fail soon.
Fortunately (again) I just buy a spare SATA disk a few days before: by speaking with the user and his manager, we decide to re-install the whole system on the brand new HDD ASAP.

Reinstalling WinXP should be pretty easy.. or not? Well.. this is a Win7Pro machine, one of the first we buy when WinXP was EOL and OEM does not sell WinXP anymore (does Canonical says that you cannot install Ubuntu 8.04 anymore????? I hate those politics on commercial software..)
We choose to downgrade it as WinXP Pro, due the fact that ALL our machine has WinXP on them, so why do I have to waste my sysadmin time in managing two Win version if I can manage only one? Why do my users have to learn another OS if the can work with a more comfortable one?

Let's go back to the install issue.. I cannot find the original CD provided with the PC, so I take another WinXP SP2 disk and install that.. When installer ask for Product Key I entered the one printed on PC chassis label.. it was wrong? Damn..
Looking a bit closes to the Product Key it says that it's a Win7 product.. Damn (again)
Finding solution for problem on closed source project it's harder that the same on FOSS.
After a bit of googling I just turn to the old fashion "phone call to the one how sell it". The answer was something like this "dont worry, enter a product key of another winxp instance".
mmmhhh..  "this doesn't broke the "another instance" installation activation?"
answer: "well.. it may, but usually it does not." - in my experience this is not true.. but someone of you my try this at home
me: "any other solution?"
answer: "enter the product key of the other instance but do not activate on-line. Choose to activate by phone" - in 2012?!?!?!?!?!!? arggggggggghhhhhhh!!! - " a recorded voice will ask you to enter the code displayed on screen. Do not enter anything, the voice will repeat a few times and later will turn you to an (human) operator. Tell the operator that you have an old instance of Win7 to downgrade to WinXP" - old because you cannot downgrade anymore, sic - "he will ask you the product code of Win7 and should generate a new key to activate WinXP correctly"

Damn.. this will take a lifetime.. I buy the license, I pay for it and now I have to waste a lot of time for activation process.. BTW Oracle does not do anything like that, even on its database license. They just kick your ass if they find you, but let you work without wasting time..
I did the whole process and enter a big (nearly 40, I think) number at the end of activation process.

So, if you have to re-install a Win7 to WinXP downgrade process as follows:

  1. install from any WinXP CD and enter any valid Product Key you have
  2. when activating the product DONT activate online, but do it on the phone
  3. when the recorded voice ask you to enter the number on screen, do NOT enter them. Even if recorded, the voice, sooner or later, will become bored about your inability of press phone digits. And will redirect you to a human operator
  4. the human operator is, of course, human, so you can speak with him/her and tell what you're trying to do
  5. if he/she says that downgrade is not possible in 2012, tell him/her that you're a sysadmin that's re-installing an old workstation: it HAVE to be a WinXP due policy restriction on your site. This should be enough to go ahead on activation process
  6. tell his/her your Win7 Product Key (which is a bit hard on phone.. but possible!)
  7. after a bit of working he/she will generate a new product key for WinXP and redirect you to the recorded voice (again!) that this time will just tell you the very-long-number to enter on-screen for activation
  8. you're done. Enjoy your eXPerience with licensed products!!!!

I'm a Linux sysadmin, but, of course, I have to work with Win too.
The Win lacks of a default decent command line shell is' already a big problem for me.
The lacks of a default decent REMOTE command line shell is a bigger problem for me.
The lacks of a default packet manager, capable of install/upgrade/update not only the OS but also all the installed software, is another problem for me.
But.. managing license and activation code is even worst. I bought the OS license (you have to do it: there's only a few OEM that give you the options of buy a workstation without OS!!) and now I cannot use them because MS just says that the product is EOL?

Cut the product support (which, BTW, I've never used), cut the bug/security hotfixes (which I do install, but without a good antivirus are meaningless) but, please, allow me to install a software that I already pay for easily!!!

giovedì 26 aprile 2012

Ubuntu 12.04 on Dell Vostro 3550

Ubuntu 12.04 is out now, perfectly in-time with the arrival of my brand new Dell Vostro 3550. Preatty happy about its Win7 Pro performance I was just waiting for the latest Ubuntu release (which is, fortunately, a LTS too!) to use Linux as first OS and relegate Win into a VM (I'll try to run the original Win7 as DomU with Xen soon!)

Download was really fast (I'm wondering how many mirrors Canonical and the community setup for this release!) and the iso was soon "burned" into a USB dongle with my nettop Ubuntu version Startup Disk Creator (I have no luck with unetbootin under Windows).
Now it's time to see is 12.04 works with my laptop. I've plug the USB dongle into one free slot and turn on the system..
Instead of the standard Ubuntu screen, after the Dell BIOS splash, I got only a black screen with the "Machine Check Error" messages!
Bad.. really bad..
First thing to do is, of course, google for the error. Someone says it's due UEFI, someone says it's due recent BIOS update (however, with other hardware), someone says...
wait..
what? after plugging the USB dongle into Win7 (and thus installing the driver) the problem disappear?? naaaaa... it's impossible! why installing a driver in Win7 should resolve this kind of problem?
Well.. I didn't find a better solution so I try this one.. and it works!!!!

I don't really know why (maybe Win7 "marks" the USB dongle somehow and the BIOS is happy with that..) but what really matter is that it works..
well no, what really matters is that a USB dongle does NOT works out of the box but, I think due BIOS restrictions, requires the intervention of Win7.. bad, bad news

I hope that in the next days I'll find "why" and "how" it works

martedì 17 aprile 2012

Telecom Italia e il (dis)servizio DoS al cliente (business)


Da circa un mesetto, c'è un baldo tecnico di mamma Telecom, che si ostina a venire in azienda per portarci via il router Cisco 877 in comodato d'uso per cessazione del comodato stesso (mentre invece la "splendida" ADSL adaptive resta in essere).
La prima volta (pur da amministratore e responsabile IT) cado dal pero: da un giro veloce di informazioni (tra titolare e segretaria) scopro che nessuno ha mai chiesto la cessazione del contratto (e vorrei vedere!), men che meno del solo noleggio.
Il buon tecnico crede in un errore e non ci lascia senza router. Io, immediatamente, mi metto a ripristinare il vecchio Cisco 857 che avevo in un cassetto per sostituire l'877. Un giro di telefonate col 191 (servizio per aziende, no baubaumiciomicio) non risolve niente: non si sà cosa succede nè chi ha chiesto la cessazione.

Ieri il baldo giovine torna a bordo della sua Panda Van rossa e io, avendo il backup già in funzione, gli restituisco il router senza una piega. Tempo di disservizio: zero. Auto-pacca sulla spalla, torno al mio lavoro "vero".

Mentre il server mi compila l'EZSDK, ritorno a prendere il mano il telefono: primo giro al 191 inutile. Ci riprovo (i call center vanno sempre chiamati 2-3 volte, per avere una visione decente del problema) e parlo con un altro tizio.
Mi conferma per l'ennesima volta che non è competenza del suo ufficio ma, apriti cielo!, crea una segnalazione "guasto" per far prendere in carico la cosa all'ufficio corretto: mi ricontatteranno entro le 10 del giorno successivo (seeee...). Mi chiede nome/cognome e un numero di cell (fortunatamente gli do il mio personale e non quello aziendale del titolare) e aggiunge:

"guardi che le arriverà un sms con i dati della segnalazione [...]. Poichè "il computer" non funziona (quindi è una cosa risaputa, ndr) le arriverà qualche sms in più. Sono tutte copie dello stesso quindi li cancelli pure e tenga solo i primi"

mmmmmhhhh.... la vedo male... "vabbè", penso, "mi arriveranno, toh, 10-20 sms? e chissene..."

BipBip.. poco dopo ecco arrivare il primo "grazie per aver contattato [...] la sua segnalazione sul numero 0434.... ha il seguente identificativo ......" (o qualcosa del genere)

bipbip, bipbip, bipbip, bipbip (pausa di 20 secondi) bipbip, bipbip, bipbip, bipbip..
10 nuovi sms, mi dice il mio buon telefono..
meglio spegnere la suoneria và..

Adesso, il buon lettore, dovrebbe fermarsi un attimo e cercare di indovinare QUANTI sms esattamente mi sono arrivati. Non credo ci arriverete vicini..

In breve, nel giro di circa un'oretta, mi sono arrivati la bellezza di 180 (centottanta!!!!) sms, tutti con lo stesso testo. Il mio povero cellulare, oltre a risultare inservibile (ho infatti perso un paio di telefonate, che sono andate direttamente in segreteria) ha anche esaurito quasi tutta la batteria.

Non so voi, ma a me questo sembrava un bel attacco (ops.. servizio, son sempre un cliente business!) DoS!

domenica 15 aprile 2012

Why opening a blog?
Well.. to have a place where to write your ideas that maybe someone else is interest in reading.
I was looking for a place to write my technical research and the weird things that happens in my work and private life as an IT expert.
I could write such a things in a social network but I don't think it's really what I want (web search engine will not scan them and it's hard to have a "professional" layout).

Apart from IT there's lot of things that happens in my life.. most of them are private and will continue to be so, but sometimes you feel that you have a nice idea and think: "why not share this with the rest of the world?".

I'm Italian so I need another note for my Italians reader..

Perchè scrivere in inglese? Bè intanto non è detto che scriva tutto in inglese! Poi perchè, nella mia testa, principalmente sto pensando di scrivere di "robbbba" informatica (o che comunque riguardi la tecnologia) e queste cose il più delle volte sono utili (e interessano) a un pubblico internazionale e, cosa non trascurabile, vengono meglio se scritte in inglese (non sò voi, ma io non ricordo l'ultima volta che ho letto l'ultimo documento/libro/datasheet in una lingua che non fosse inglese).

Molto probabilmente sceglierò l'inglese per i post tecnici e l'italiano per tutto il resto: anche perchè sono abituato a leggere in inglese, ma non a scrivere, quindi lo sforzo è mica da ridere (ma il risultato forse si!)