Wow did I ever mess up.

You may have noticed mindstab.net was missing for the last 7 hours. What caused that was a set of events and not enough care.

It started I’m not quite sure when, but sometime in the last few months when I upgraded Kvasir’s baselayout from 1.9 to 1.11. Then I forgot about it.

Yesterday I noticed that Wildfire, a Java Jabber server, had be un-hardmasked. I thought I’d give it a try. How could a Java Jabber server take mindstab.net down you ask? Wait. So I tried to install it but unfortunately the install failed. version 3.1.0 installed, but not 3.1.1. I filed a bug. Today I got some advice mentioning it may have something to do with my old hardened kernel, 2.6.11. A suggestion was made to upgrade to the recent stable 2.6.17 because some /proc access issues had been resolved and that might be related to my problem. So I installed the new kernel during the day but thought I’d wait till I got home before rebooting in case there was a problem, I could then at least just reboot with the old kernel.

So I got home, and I had a few minutes before I had to go to work, and so I rebooted. The new kernel worked fine, however as soon as init took over, everything went straight to hell. Init complained udev wasn’t installed (remember, new baselayout), and I’m also pretty sure that the newer kernel’s don’t provide the static /dev filesystem anymore. So now init couldn’t find any harddrives. Crash. The end. Time to go to work. The old kernel won’t make a difference because it’s init’s problem.

Ouch. Stupid on my part. Very stupid.

So I went to work and came home and prepared to address the problem. I opened up Kvasir. Kvasir is the smallest 1U rackmount server not much money could buy. The case was designed to hold one harddrive and a cdrive floppy drive combo. I removed the cddrive floppy drive combo and put a second harddrive in so I could have a RAID 1 setup (mirrored harddrives). Great, but now I had to unhook one of the harddrives and put in a CD drive, in cramped space. Got it done though. However I then had some trouble with my LiveCDs. My full 2006.1 didn’t boot for some reason. 2005.1 booted, but “humorously” when I gave it the -noX option, it still scrambled the root password and so wouldn’t let me log on. “Awesome”. I had another 2005 liveCD but raid support seemed sketchy.

I finally just downloaded the 2006.1 minimal ISO and burned it on Nika. That worked and booted, but no md* entries in /dev. I did some googleing and found three pages that combined provided what I needed. First I had to make sure the modules I wanted were loaded.

# modprobe dm-mod
# modprobe raid1

Then to get the lay of the land and remember my setup I ran

# mdadm -E --scan 

I then remembered the basic layout of my RAID setup (mdN = hda(N+1) + hdc(N+1)). Next I had to manually create the /dev/md* devices, and in my case I had to force their creation because one the two mirror harddrives was “missing” (unplugged).

# mdadm -A /dev/md0 /dev/hda1 --run
# mdadm -A /dev/md2 /dev/hda3 --run
# mdadm -A /dev/md4 /dev/hda5 --run
# mdadm -A /dev/md5 /dev/hda6 --run

The –run forces the mdadm command to ‘A’ssemble the array even with missing harddrives. This created the /dev/md* devices I needed to safely access my harddrive without screwing up the larger RAID setup.

Now that I finally had access to my harddrive it wasn’t so hard to fix the problem. I chrooted it and installed udev and then rebooted with fingers crossed. And it worked! Still, a few other boot up errors: the new kernel was giving iptables some guff, courier IMAP wasn’t starting, and the conf.d/net syntax was wrong. But there were more pressing matters first.

I powered down and unhooked the cdrom drive and plugged back in the second harddrive and powered up again. Boot whet as fine as before but a

# cat /proc/mdstats

revealed that the second hardrive wasn’t being used yet (probably because it was out of sync and the system wasn’t sure what I wanted it to do with it).

Personalities : [raid1]
md2 : active raid1 hda3[0]
   10008384 blocks [2/1] [U_]

md4 : active raid1 hda5[0]
   20000032 blocks [2/1] [U_]
...

So, to get them re-enabled as mirrors, it turned out all I had to do was re-add them and the system would slowly re-mirror them.

# mdadm /dev/md2 -a /dev/hdc3
# mdadm /dev/md4 -a /dev/hdc5
...
# cat /proc/mdstats
Personalities : [raid1]
md2 : active raid1 hdc3[1] hda3[0]
   10008384 blocks [2/2] [UU]

md4 : active raid1 hdc5[1] hda5[0]
   20000032 blocks [2/1] [U_]
   [====>....................] recovery = 32.2% (6200000/20000032) finish=21.3mi speed=10588K/sec

md5 : active raid1 hdc6[1] hda6[0]
   80000052 blocks [2/1] [U_]
      resync=DELAYED

That started, I moved on to sorting out my other problems by upgrading a few other packages and recompiling my kernel with some new netlink stuff for iptables.

And so mindstab.net is now returning to full capacities, but what a nightmare that was. I’m sorry for any inconveniences this caused anyone. I will start enforcing a stricter self policy of when I can do low level server work that requires reboots. And now I terribly need sleep.

References