How to Automate Drupal Deployments with Rollback

March 12, 2020
How to Automate Drupal Deployments with Rollback Part One

You might know how to deploy Drupal, but do you know how to automate Drupal deployments? Even more, do you know how to automate a rollback of a Drupal deployment? In this article, we’ll cover both topics using scripts I’ve used and proven to work well for most use cases.

Why Should You Automate, Anyway?

I encourage you to take some time to read through Code Climate’s blog post, “State of DevOps 2019 Tells Engineering Organizations to ‘Excel or Die.'” The article discusses the DORA report and arrives at an interesting statement:

“Transitioning to some form of Continuous Delivery practices shouldn’t be a question of if or even when. Rather, it should be a question of where to start.”

Automating your deployment and rollback steps is a great place to start. Doing so can bring stability and consistency to your release management process, and it’s a critical step on the way to Continuous Delivery.

Goals and Assumptions

Now, before we dig in I have a few items to cover.

First, I should outline some assumptions:

You’re using Drupal 8 or higher.

You are using Drush 9 or higher to import and export to deploy important changes to your website’s configuration entities. You are modifying the configurations on a lower environment, such as your local development environment, and exporting those changes to code that gets pushed to the canonical environment where you import the changes.

You’re deploying a build artifact. In other words, you have a build step in your delivery pipeline that creates the changeset you are deploying.

You have a canonical environment, which means a single source of truth for state that is not deployed with the code repository, such as the database and media. Usually, this is the environment called production.

You have a pre-production environment, which is an environment that closely matches production. The names Staging or Test are also commonly used to refer to this environment.

Second, we should outline some goals for deployment automation:

  1. Minimize the likelihood of bugs or downtime.
  2. A deployment should be indicated as failed if the failed step in the script cannot be successfully addressed with automation.

Last, we should outline some goals for rollback automation:

  1. The state of the system at a failed deployment should be auditable at a future time, even after rollback is complete.
  2. The rollback has a restore point at which it is restoring to and is optimized for restoring the entire system to that state.

Now, let’s look at the scripts. It’s important to keep in mind as you read through these scripts that they are intended to achieve the goals outlined above. There are cases where these scripts are unfit, so let’s consider these scripts a starting point from which you can make adjustments for those special cases.

Deploy to Production

Let’s first look at the deployment script for production and then we’ll discuss it line-by-line. As you read the script, keep in mind that if any step fails, the deployment process should halt unless otherwise specified. Here is the script:

drush sset system.maintenance_mode TRUE
# Create a restore point by taking backups of anything that is not in the code repository: database, media, cache
# Checkout the code you are deploying
drush updb
drush cim sync -y || drush cim sync -y
drush cim sync -y
drush sset system.maintenance_mode FALSE
drush cr

Line 1: We enable maintenance mode to ensure our restore point is consistent. If our restore point is not consistent we may run into bugs relating to mismatched state, e.g. an entity reference that doesn’t exist.

Line 2: Creating a restore point is highly coupled with your hosting provider. This step is expensive in terms of time, but it is critical to a successful rollback. At minimum, you’ll need to include the database. You may also consider including media (public and private files), any additional application state like caches, or your search index.

Line 3: Checking out the code you are deploying is also highly coupled with your hosting provider. You will want to make sure you are checking out a deployment artifact that has a unique pointer (e.g. a commit hash or tag) in your Version Control System (VCS). This will be useful during rollback unless you can do a complete restore of the filesystem containing the code (e.g. if using containers).

Line 4: Running Drupal’s update process should be the very first thing to run. Do not clear any caches before running updates! This may seem counter-intuitive, but it is requisite to certain types of updates like converting field definitions, see a commerce example here. Remember, the point of update hooks is to make changes to application state or configurations like container services, entity types, field definitions, block plugins, etc. If the application doesn’t have information about the previous state of things, it will try to load the new field definition from code which may not match the database schema (or a myriad of other issues) and throw an error.

If you observe that the update process is failing, double-check whether a full cache rebuild is being run by an update hook or whether a hook is trying to read and use things like field definitions. See this commit on Acquia Lightning for some commentary on what can happen in practice.

Directly related, take a look at Issue #3100553 on Drupal Commerce which explains the challenges that some updates might run into. Thank you Matt Glaman and Centarro Support for helping clarify this issue!

One last point here: did you notice we aren’t using drush entup? That’s because it is deprecated in 8.7.0, and for good reason. As discussed above, it is the responsibility of update hooks to ensure the proper state and configuration changes are applied. You can read more about this on Issue #2976035.

Line 5: Immediately after running the update process, you’ll want to import your configuration changes. You’ll notice the line has two pipe characters. There is an issue with the config import process where if a module is installed and configuration entities are being created for that module it may fail to create the config entities because of missing dependencies. This is a cache issue and we don’t want to fail the entire deployment if that is why it is failing. To achieve this, we run config import again if the first time it fails — that is what the double pipe does. See the bash reference for how the double pipe works.

Line 6: Assuming the first config import completes successfully, we need to run the config import a second time. The first config import may include changes to the behavior of config import, such as config ignore or config split. We must run config import a second time to ensure those changes take effect.

Line 7: At this point, we are done with the critical parts of deployment. Now, we disable the maintenance window to allow regular use of the site again.

Line 8: Last, we clear caches to ensure all caches are clear from any caches generated while the maintenance mode was enabled. For example, if page cache is enabled, it may have cached a page showing the maintenance page which we don’t want since the site is no longer under maintenance.

That’s it! We now have a universal starting point for a deployment script, with the caveat that lines 2 and 3 will vary per hosting provider and may not be a single line.

Deployment to Pre-Production

When we deploy to a lower environment like a pre-production environment (sometimes called staging environment), we need to adjust our deployment script slightly. Same as the production deployment script, a failure at any step should halt the deployment unless otherwise specified.
Here is the script:

drush sset system.maintenance_mode TRUE
# Create a restore point by taking backups of anything that is not in the code repository: database, media, cache
# Copy application state from the production environment to the target environment
# Checkout the code you are deploying
drush updb
drush cim sync -y || drush cim sync -y
drush cim sync -y
drush sset system.maintenance_mode FALSE
drush cr

You can see that the only difference is we have a new line added at Line 3. Recall that our canonical environment, production, has application state that is not tracked in code. Therefore, to ensure we are performing a deployment to pre-production in the exact same way we would deploy to production, we must copy all application state to the pre-production environment. At minimum, this should include the database. We need the database for the obvious reasons that it includes content and unmanaged configurations not tracked in code. Also having media (public and private files), caches, and the search index can allow us to rehearse the exact way these changes will deploy to production.

Deployment Rollback on Production

The most challenging part of rolling back a deployment is perfectly matching that point in time prior to the rollback. Since our deployment step captures all application state we have that point to roll back to! Here is what a scripted rollback looks like, and again if any step fails we halt the rollback unless specified otherwise:

drush sset system.maintenance_mode TRUE
# Create backups of anything not in the code repository for auditing later: database, media, cache (we’ll assume logs are sent to an aggregator)
# Checkout the code you are deploying (i.e. the code from the last successful deploy)
# Restore anything that is not in the code repository from the restore point: database, media, cache
drush sset system.maintenance_mode FALSE
drush cr

Line 1: Put the site into maintenance mode so that we can take backups for auditing later. If this step fails, don’t fail the rollback as it may be symptomatic of the very reason you are doing a rollback.

Line 2: Run the backups to capture all the application state for future auditing. By having the complete application state, we can potentially recreate the issue that required performing a rollback.

Line 3: Check out the code from the last successful deployment. If your backup process includes the code and filesystem together, then perhaps you can group this in with Line 4.

Line 4: Restore all application state. At minimum, this is the database. But, you may desire to include media (public and private files), caches, and the search index.

Line 5: Disable the maintenance mode to allow regular use of the site again.

Line 6: Same as a deployment, we clear caches to ensure everything is clear from the time period during which the maintenance mode was enabled.

Deployment Rollback on Pre-Production

You’ll want to be able to test your rollback process and being able to do that on pre-production has many benefits. The steps here will look the exact same as production. Since our pre-production deployment process includes a step to backup all application state we have a restore point we can use to do a rollback.

A Closing Thought

We mentioned these scripts should serve well as a starting point from which you can make adjustments according to the goals and needs of your project team. Every software project and every client has unique needs and constraints that challenge what we know as a “good” deployment or rollback process. As you go through the process of implementing automated deployment and rollback scripts, reflect on whether the process and tools being put into place are making reasonable use of our universally precious resource: time.