Wednesday, June 6, 2012

Remote system upgrade (with grub and bmc-watchdog)

IPMI is a very very powerful tool for system administrators, especially those telecommuting ones. It's serial over LAN (SOL) support eliminates the need to personally sit in front of a server to do any pre-network operations, including reconfiguring the BIOS settings. However, it does require (A) an additional IP address to access the IPMI network interface from the Internet; or, when no additional IP can be allocated, (B) the access to a second server on the same LAN (not necessarily with administrator privilege). When either (A) or (B) is available, you can theoretically do anything remotely including fresh installation of an operation system (starting, for example, with a network boot and/or a remote drive).

Unfortunately, one of my recent situation allowed neither (A) nor (B). So, the first installation had no option but to be done by on-site personnel. But, once a networked system was up and running with a working grub boot manager, I could remotely install a new system on an unused (or a large enough swap) partition and test it out with the "boot once" support of grub. On a Debian based system with grub-2, this involves
  • changing the value of "GRUB_DEFAULT" in /etc/default/grub to "saved",
  • running "update-grub",
  • editing /boot/grub/grub.cfg to make an entry for the new system (if it was not discovered correctly by grub-probe),
  • running "grub-reboot" for the entry, and
  • rebooting the machine.
However, in most cases, you are bound to make some mistakes in the new system and fail to recover network contact to the server until an on-site person can hit the reset button of the machine for you.

Lucky for me, the BMC of the IPMI on the server did have a working watchdog timer. Therefore, I could setup the timer with enough time and start it before rebooting the machine. That way, if the new system worked, I could login to the server through the Internet and stopped the timer. But, if the new system got stuck, the watchdog would do a hard reset on the machine after the time ran out and returned to the original working system... no more waiting for on-site personnel. The actual command I used to setup the timer is bmc-watchdog from freeipmi:
  • bmc-watchdog -s -u 4 -p 0 -a 1 -F -P -L -S -O -i 900
One can consult the man page for the meaning of these options. Simply, this sets up 15 minutes on the timer for a hard reset, which can be checked with
  • bmc-watchdog -g
started with
  • bmc-watchdog -r
and stopped with
  • bmc-watchdog -y
(While, theoretically, one can achieve the same result with ipmitool, it did not work for me on the specific system.)

No comments: