I’m in the process of moving all of my servers to a small OpenVZ cluster. I had hoped that this would be painless, but some of my servers are not behaving well. The web server appears to working fine, but If I can’t figure out what the problem is I may have to take everything down for maintenance. Don’t expect the blog to be available for the next couple of days.
Turns out that most of my problems didn’t have anything to do with the move. It was only some old (and in one case very old) misconfigurations that didn’t take effect until I rebooted one of my servers; that, and I had forgotten just how much memory Gitlab uses.
The cluster is almost complete. I only need to configure Heartbeat and it should be done. Unfortunately, moving my servers revealed some rather alarming deficiencies in how some of the older servers have been configured, how backups are handled, and especially in how they are monitored. One server (albeit not a very important one) had not been backed up for months, and I had not been alerted of this. This is not acceptable!
The cluster will have to wait until I have dealt with this.
I have learnt a few things.
First, backup systems will (in this case at least) only work if all clocks are in sync, and if one of the servers doesn’t have a CMOS battery then any NTP failures will break the backup system. If the server in question also doesn’t have a proper hard drive to store logs on, can’t run OSSEC, can’t store logs remotely using rsyslog, and doesn’t send system mail to an external mail server, then the first you will hear of this is when you need those backups.
Second, a DRBD cluster is not worth the effort of managing properly for a couple of private servers. It was very easy to set up, but robust it was not. My first attempt to try taking a node down ended with the whole cluster going down and refusing to synchronize. It was probably my fault (it was in the middle of the night, so I’m pretty sure it must have been my fault in some way), but when the DRBD service started giving error messages and crashing, I decided it was time to try something else while I knew the data on one node was still good.
I did not want to be forced to restore my servers from backups. The reason I did this in the first place was because I’m unhappy with my backup system and I needed a stop-gap I could use until I can set up something more robust. Perhaps I will try this again in the future, but for now I will fall back on simple snapshots.
So ends my latest adventure with needlessly complex solutions to simple problems; in disgraceful failure and the use of a more appropriate tool. Next on the list; soldering and configuration management systems.