Death by PEBKAC evaded by amazing ZFS snapshot CTRL+Zery

Tonight I was doing a little development work towards a telemetry system I’m building for a thing. Along the way I managed (like a 10/10 n00b) to delete a bunch of vital configs on my reverse proxy server that handles all my traffic. Thanks to the amazing ZFS snapshot function made easily available by FreeNAS, I was able to recover from this otherwise devastating fckup, super fast and without service disruption.

Death by PEBKAC

So I reverse proxy loads of personal things (including stuff that other people actually use) through nginx on a VM. This VM contains not only virtual host definitions, but also SSL keys for each of these services.

I’ve been gradually rolling-out Ansible-driven config management and I actually use it fairly widely now, but not THAT directly for my nginx configs as yet. For managing nginx configs, I use my previous model, which is that I check in things to a private git repo.

Here’s how I kicked myself to the curb:

  1. I added my latest host configs to my revision control. And I bulk added other stuff I was sure wasn’t already in revision control.
  2. I then decided I didn’t want to use this new host, so I deleted it. Then I made a new commit of that.
  3. THEN I changed my mind again and decided I wanted it back. So I ran git revert (which I’ve never used, not with git, before).

I made two hilarious mistakes here that nearly caused me hours of rebuilding effort, they were:

  1. When I added stuff I wasn’t sure was in revision control, I didn’t check that everything actually added. I’ve often wiped-out over how git adds files. I find it counter-intuitive when I declare add a directory and it seems to just add the directory root. Seriously, would anyone ever actually want that? I guess probably yes, Git was made by super-heros afterall.
  2. I assumed that git revert would only touch files/dirs mentioned in revision control. HAH! Nope. It changes directories structures to exactly match what’s in revision control. This is totally reasonable to me, just not what I assumed was going to happen.

So there I was left with a functional server. nginx had not been restarted and thus had file handles on everything I needed it to. But the file system handles to those files had all been deleted. Thus the moment I restarted nginx, I knew I’d be screwed and have to go rekey a ton of SSL certs and remake a bunch of bland nginx reverse proxy configs.

Shaka.

ZFS snapshots ctrl+z the fail

I back my VMs with a decent little box running FreeNAS. With this, my VMs thus live on ZFS. I also had to good fortune of having established automatic ZFS snapshots (I also remotely replicate them to another ZFS box, offsite). I’d never actually tried using a ZFS snapshot before, so I wasn’t sure it’d work the way I assumed. But they totally do work as I assumed and through this, I was saved.

On the FreeNAS box I selected the latest snapshot of the volume I keep my VMs on. I cloned this snapshot to create a new ZFS volume. To my surprise, the cloning operation was instant. How efficient, clearly there were no copies. I guess ZFS deals heavily with deltas. Fucking excellent. With the new volume defined, I added a NFS share pointing to it. On one of my virtual hosts, I then attached the new NFS share so I could access the snapshot’s contents. Then I registered the reverse proxy VM from the snapshot (all while the original one kept serving traffic, which it could do so long as I didn’t touch its nginx service). The snapshot-stored rev proxy VM booted. Complained it’d been crashed (as one should expect). After it booted I copied the missing files from it directly to the still-live original VM. Then I trashed the entire snapshot volume and DONE. Recovered.

I’m quite amazed. Thank you ZFS, FreeNAS, FreeBSD! You saved me from myself and I am grateful.

The end.

PS – I know storing SSH keys in git is insecure from a purist perspective. Let’s just say, I’ve considered that and feel sufficiently covered.

Epilogue

There’s a little more to this story. Before doing much of anything, I took a VMware snapshot, including memory, of the live nginx VM. Through this, I felt I had a way back to a remotely functioning run-state, even though this wouldn’t solve the actual problem. My plan was to revert to that as much as I needed to while I rebuilt the stuff I’d lost. As it happened, once I had rsyned-in the files that I’d nuked from my ZFS snapshotted clone of the VM, I restarted nginx and the live VM never skipped a beat. RAD!