Why you need to destroy production everyday

Very often I see organizations struggling with continuous delivery, because they only apply the practice to one layer of the whole technology stack.

Very often I see organizations struggling with continuous delivery, because they only apply the practice to one layer of the whole technology stack.

What happens is that they have an automated build and release system (CI/CD) that ultimately deploys a workload (a container or virtual machine) to an environment multiple times a week or day, but fail to repeatedly and regularly deploy the environment itself.

I was inspired by Eleanor Saitta, a principal consultant and public speaker. In her opionion, workloads should be refreshed multiple times a day. I will put a link to one of her talks at the end of this post.

The benefits of an ephemeral environment

Sure, more and more environments are provisioned with Terraform, or a similar configuration tool, but fact is that you rarely see a production environment (or even staging/acceptante environments) redeployed from scratch, more than once per decade.

Mostly, the environment is set up with Terraform once, and then never destroyed. Over time this can cause the codebase to become instable, because applying small deltas does not always behave the same as applying from scratch.

It happens that people with (too much) access change something in a mostly static environment manually, and from there the Terraform state, codebase and environment start drifting.

However, aiming for a full re-deploy of the production environment, overnight, preferably with zero downtime, is the best defense against the before mentioned, and a lot of other things that can harm the availability, stability and integrity of your production environment, and essentially your business.

  • You will have a really fast and well tested backup/recovery system. Whatever crashes your production environment, just re-deploy like you do every day.

  • You will have a strong defense against hacking. Whatever part of the environment is impacted by malware, just kill off / quarantaine the affected components and re-deploy the last healthy snapshot in minutes.

  • You will save money. Striving to deploy your full production environment on demand, will uncover deficiencies in your process that can often be elimated; solving these problems will almost always result in less overhead, less maintenance, and less overall cost.

Get the infrastructure team on your side

Recreating a production environment overnight feels really counterintuitive. To bring it into practice in an organization that does not have much experience with automation, you will face a lot of fears and misconceptions. A few examples:

  • the production environment is static, there’s no need to recreate it.
  • it takes too long to recreate the environment overnight.
  • some things just can’t be automated

They are all false, but of course there are also gradations of automation. Not everything as to be automated at once!

The production environment is static, there’s no need to recreate it

While it is probably true that there is no direct need to recreate the production environment, it is better to do it before the need arises (because when the need arises, you are probably executing your Disaster Recovery Plan).

A lot of companies that have suffered data breaches in the past, sometimes experienced relapses, because their adversaries were able to maintain backdoors in their systems. Sometimes for years.

If your organization can recreate its production environment(s) on demand, you can prevent a lot of serious problems from happening.

It takes too long to recreate the environment overnight

If you try to recreate the environment overnight by hand, then it will probably take too long. It is impossible to expect to be able to redeploy your current production environment by the end of next week.

Before you recreate the production environment for real, you will have to declare every aspect of that environment as code, and test it first, in a separate cloud tenant, for instance.

Some steps just can’t be automated

When this argument is thrown, it’s most often used by people that have no idea about where to start. They rather keep things as they are (who wants to improve things when they’re already good, anyways), and avoid having a mental exercise like imagining how to automate something that just can’t be automated.

This can happen, because some manual processes seem (and are) very complex at the surface. However, when you are brave and take some time to sit down and go over the process step by step, you will notice that there are unexpected automation oportunities that will make your life easier and your business grow faster.

Conclusion

You need some courage to destroy your production environment. But by taking time to learn this, you gain so much benefits in terms of reliability and security. Come talk to us if you have a project in need of automation.

Eleanor’s talk on the State of Cyber Security

Tags: