A tale about a Legacy Application, Infrastructure as Code automation, and ideas on reducing the MTTR

Seasoned systems engineers may remember the time when a cloud was something everyone ignored on a network diagram. It was a time when applications ran on electrified metal and we all knew where our machines lived. In a closet at the end of the corridor. We stopped ignoring the clouds when they brought us data centers on demand along with our bosses’ excitement for saving money.

Over the years written-off bare metal went to the scrapyard and cloud costs skyrocketed, the latter was much to our bosses’ surprise. If there’s any form of survival artist in information technology it is the good old legacy application. The legacy application is a crucial piece of software, written with the best intents, in a forgotten time, by our long-retired colleagues. Needless to say, the legacy application is of utmost importance to the business’ continuity.

When the legacy application, as one of the very last, moved from the cabinet at the end of the corridor to the cloud we did what humans usually do when applying new technology: We used it the way we used its predecessors ignoring its inherently new opportunities. The bare metal became a virtual machine and that’s all that changed. The recovery process in the cloud today isn’t much different from the bare metal recovery process from the past.

The legacy recovery process

Once a legacy application lives in the cloud a systems engineer may be motivated to use Infrastructure as code automation, such as Terraform. Automation is good at ensuring that the actual state of cloud infrastructure precisely matches the codified intent. So good, that automation sometimes causes outages by aggressively modifying the infrastructure layer of applications. A classic example is a memory upgrade to a virtual machine that given the right mix of configuration and cloud vendor plugin can result in a destroy and recreate sequence.

You heard that right: In those scenarios, automation engages in a sequence of first destroying a virtual machine and then recreating it with the new set of desired properties. Users of cloud-native, containerized, stateless, orchestrated, all-the-fancy-things workloads barely notice such infrastructure changes. They are usually designed to run on top of ever-changing infrastructure (which runs on top of everchanging infrastructure which runs on top of…). Users of our time-honored legacy application, however, find themselves wandering the forest of dangling requests when the one virtual machine hosting it suddenly disappears. The Mean Time To Recovery (MTTR) is often in the order of hours here, starting by installing the operating system again and walking all the way to handling requests.

If you are used to the different but similarly breathtaking challenges of a cloud-native, well-orchestrated workload landscape you are probably horrified about all the many steps it takes to recover the legacy application. In legacy land, there are now disposable pods, no containerized applications, and more often than note, state is an ill-defined term. We can, however, apply some of the wonders of cloud computing to reduce the overall MTTR.

We’ll look into each.

Saving state for a speedy recovery

Performing recovery actions ahead of time

Have a tested replacement system ready at all times

The best way of reducing MTTR and having confidence in the recovery process is a combination of all the above ideas: Regularly spinning up a virtual machine, using the latest base image, installing the latest version of the legacy application (joking, it has never been updated in the first place, it’s legacy!), and configuring it to use a copy of the latest persistent data snapshot from production. If all of this starts up just fine we can gain bonus points by throwing copies of production traffic at it or running a set of artificial requests to verify it works properly.

And next time automation kills our production legacy application, we can confidently run our well-tested scripts to recreate the latest snapshotted state from what was production until a few minutes ago. Hooray for automation repairing damage done by automation!

Disclaimer

Of course, nothing of what I outlined here has ever happened to anyone in the world. This is just a product of my imagination. Anyway, the boss said the legacy application is scheduled for turn down next month. No need to invest in risk management there…