A tale about a Legacy Application, Infrastructure as Code automation, and ideas on reducing the MTTR

Jun 2020

Seasoned systems engineers may remember the time when a cloud was something everyone ignored on a network diagram. It was a time when applications ran on electrified metal and we all knew where our machines lived. In a closet at the end of the corridor. We stopped ignoring the clouds when they brought us data centers on demand along with our bosses’ excitement for saving money.

Over the years written-off bare metal went to the scrapyard and cloud costs skyrocketed, the latter was much to our bosses’ surprise. If there’s any form of survival artist in information technology it is the good old legacy application. The legacy application is a crucial piece of software, written with the best intents, in a forgotten time, by our long-retired colleagues. Needless to say, the legacy application is of utmost importance to the business’ continuity.

When the legacy application, as one of the very last, moved from the cabinet at the end of the corridor to the cloud we did what humans usually do when applying new technology: We used it the way we used its predecessors ignoring its inherently new opportunities. The bare metal became a virtual machine and that’s all that changed. The recovery process in the cloud today isn’t much different from the bare metal recovery process from the past.

The legacy recovery process

Create the Virtual Machine: Based on the scientifically unproven rule-of-thumb we create a virtual machine with a set of parameters that reflect our willingness to spend our bosses money and our mood of the day.
Install the Operating System: It’s legacy, so we can safely use our battle-proven favorite Debian version. Any distribution that is supported is good enough. Some of us have to run legacy applications on Windows, though.
Update the Operating System: Just the usual, everything from typo fixes to catastrophic remote code executions goes here. This might take a while if we choose to spend less on our bosses’ money on an oversized virtual machine earlier.
Install the Application: From the last remaining floppy disk we install the sacred legacy application, re-introducing a myriad of critical bugs along the way. But no worries, we’ll deprecate the whole application soon, at least that’s what our boss told us the last ten years.
Configure the Application: This could be anything from copying a config file that worked in the past up to a mystic special case tweaking of parameters based on tribal knowledge. The goal is to make the application work, nothing more. Often enough this is challenging enough.
Make available state-persisting storage: Whatever the underlying storage technology might be, we have to make it available to the application. This could be configuring a database host and related authentication. In other cases mounting a persistent disk that is not the root disk is sufficient.
Restore the state from a backup: This might take a while and the progress bar is guaranteed to be stuck at 99% for an uncomfortable long time. Stay cool!
Start Application: The magic moment! Will it work?
Handle Requests: Here we are, for the first time in the process providing value to the business. Cool, the boss likes it when we add value to the business.

Once a legacy application lives in the cloud a systems engineer may be motivated to use Infrastructure as code automation, such as Terraform. Automation is good at ensuring that the actual state of cloud infrastructure precisely matches the codified intent. So good, that automation sometimes causes outages by aggressively modifying the infrastructure layer of applications. A classic example is a memory upgrade to a virtual machine that given the right mix of configuration and cloud vendor plugin can result in a destroy and recreate sequence.

You heard that right: In those scenarios, automation engages in a sequence of first destroying a virtual machine and then recreating it with the new set of desired properties. Users of cloud-native, containerized, stateless, orchestrated, all-the-fancy-things workloads barely notice such infrastructure changes. They are usually designed to run on top of ever-changing infrastructure (which runs on top of everchanging infrastructure which runs on top of…). Users of our time-honored legacy application, however, find themselves wandering the forest of dangling requests when the one virtual machine hosting it suddenly disappears. The Mean Time To Recovery (MTTR) is often in the order of hours here, starting by installing the operating system again and walking all the way to handling requests.

If you are used to the different but similarly breathtaking challenges of a cloud-native, well-orchestrated workload landscape you are probably horrified about all the many steps it takes to recover the legacy application. In legacy land, there are now disposable pods, no containerized applications, and more often than note, state is an ill-defined term. We can, however, apply some of the wonders of cloud computing to reduce the overall MTTR.

Saving state for a speedy recovery
Performing recovery actions ahead of time
Have a tested replacement system ready at all times

We’ll look into each.

Saving state for a speedy recovery

Snapshot the system state: We are not working with spinning rust (actual disks in a physical server) anymore. What the system sees as a disk is a series of bytes living in the cloud. Whatever the underlying technology may be (Elastic Block Storage, Persistent Disk), we can snapshot it. This isn’t a backup, but it is a starting point. Should automation kill our virtual machine we can recreate it using the system disk snapshot. This reduces the MTTR from hours to minutes with some (minor?) loss of state. Of course, this only works if the filesystem can deal with interruptions. Journaling file systems are pretty good at that.
Snapshot the persistent storage: This might help in a recovery scenario but it is also a good idea in general. For file systems, the aforementioned requirements of handling interruptions apply. For databases, we’re relying on proper transactions being used.

Performing recovery actions ahead of time

Prepare a base operating system image: We can fire up a virtual machine and install the operating system along with all its updates and system-level tweaks and quirks. If we automate this step by using installer scripts that most operating systems ship nowadays we can bake ready-to-use images at regular intervals, e.g. weekly or even daily. We can even go one step further and run the image in a virtual machine similar to the one hosting our legacy application. This will tell us if the base image works as expected in the cloud environment.
Take the most recent base image and install and configure the legacy application on top of it. Similar to the previous idea, we run the full thing either manually or if applicable via configuration automation systems (think Puppet, Ansible, the like). This will show us if the legacy application still works with the latest operating system updates and libraries long before we update our precious production machine. Not only reducing MTTR but possibly avoiding an outage before it happens.

Have a tested replacement system ready at all times

The best way of reducing MTTR and having confidence in the recovery process is a combination of all the above ideas: Regularly spinning up a virtual machine, using the latest base image, installing the latest version of the legacy application (joking, it has never been updated in the first place, it’s legacy!), and configuring it to use a copy of the latest persistent data snapshot from production. If all of this starts up just fine we can gain bonus points by throwing copies of production traffic at it or running a set of artificial requests to verify it works properly.

And next time automation kills our production legacy application, we can confidently run our well-tested scripts to recreate the latest snapshotted state from what was production until a few minutes ago. Hooray for automation repairing damage done by automation!

Disclaimer

Of course, nothing of what I outlined here has ever happened to anyone in the world. This is just a product of my imagination. Anyway, the boss said the legacy application is scheduled for turn down next month. No need to invest in risk management there…