Recipes for Automated Drupal Deployment and Rollback

Tags:

March 12, 2020

Recipes for Automated Drupal Deployment and Rollback blog image

Previously we shared scripts for automating Drupal deployment and rollback. Those scripts serve as a starting point, but every project has unique constraints and challenges that merit making customizations. And, there are a myriad of problems you can run into in practice. Inevitably, every project will be deployed or rolled back in a unique way, especially for enterprise applications.

Since automated deployment and rollback are typically set up once with minor changes over time, you may find yourself never having a chance to see how others have solved similar problems. Wouldn’t it be encouraging if the solutions you’ve come up with are the same that others are using too? Or, what if there was a new idea that could improve your process? You’ve come to the right place.

Having set up automated deployments for a variety of Drupal projects, I’ve noticed a few patterns that are worth sharing. This article is written in a "recipe" fashion, so you are encouraged to skip around to parts relevant for you.

Using Drush

A number of deployment tutorials talk about running deployment steps from the browser. Don’t do this! That isn’t automated. And, the environment configurations when using the browser are different than command line; a common example is the browser running into an execution timeout or memory limit that doing it from the command line wouldn’t cause.

You should be running these in a scriptable way using tools like Drush, Drupal Console, Acquia BLT, or if your hosting provider provides an API that works too.

You Don’t Have a Build Artifact

Another thing a number of deployment tutorials talk about is running composer install as part of the deployment steps. If you are doing this you might be awaiting a disaster.

Here’s an example why: remember in 2016 when a huge number of build processes started failing because a library was deleted in the NPM registry? Have you ever had a build fail with a connection timeout when composer was trying to download a patch from drupal.org? You do NOT want your deployment success dependent on the stability of the Internet or community libraries!

The wisdom that Michelle Krejci shared with me long ago comes into play: "Production is an artifact of development." (You can watch her talk from DrupalCon 2016). At minimum, you should be running composer install and committing those dependencies to the repository as part of a build process. This creates an "artifact" that you can then deploy.

Personally, I recommend using Acquia BLT, and there is a quick video you can watch to see what the tool does in action. BLT is compatible with Acquia and any hosting provider that uses git for managing code.

Pre-Production Deployments Are Failing When Scrub Runs

Please note the two different database scrub processes that may run on Acquia. They serve as a good reminder that you need to be aware of what processes are running on any hosting provider to ensure that after you deploy code changes the update process completes before anything else.

Acquia Cloud Site Factory (ACSF)

You’ll need to be careful of the 000-acquia_required_scrub.php hook in post-db-copy. You’ll notice that the scrub process rebuilds the router (see line 159) which will likely run before your update process on pre-production deployments.

A suggestion is to implement a post-db-copy hook named with the prefix "000-aaa-" to ensure it runs before the scrub. Keep in mind this hook does not run on production deploys, so you’ll need to implement the update process differently for production deploy.

Acquia Cloud

Be careful of the db-scrub.sh hook in post-db-copy. You may only have this hook if you’re using BLT. But, if you do have it, you’ll notice that it runs a full cache rebuild (see line 161), which of course causes the problems we’ve discussed on pre-production deployments. You can use the same suggestion for addressing the ACSF scrub hook here as well.

Copying Caches From Production

When you are deploying to pre-production, you want to copy all application state, including caches, to the pre-production environment. If you aren’t storing caches in the database and are instead using a storage like memcache, you may not be able to copy caches on deploy or it simply may not be practical.

If you can’t copy the caches, then you have an awkward application state when the update process runs: the caches are representative of the pre-production state before the deployment, which is a complete unknown, but the database is from production. Shouldn’t that cause issues during the update process since caches haven’t been rebuilt yet?

I haven’t heard of any issues proven to be caused by this, but if you suspect this is causing issues for you, there is something you can consider: put important caches in the database. Then, when the database is copied the caches get copied as well. In concept, if you’re using memcache the code would look something like this:

$settings['cache']['bins']['render'] = 'cache.backend.memcache';
$settings['cache']['bins']['dynamic_page_cache'] = 'cache.backend.memcache';
$settings['cache']['default'] = 'cache.backend.database';

The idea would be to put all caches in the database and then choose which ones are OK to leave in memcache or other cache storage. Be warned, I haven’t proven this works in practice!

Improving Speed of Deployments

There are a myriad of reasons why you want your deployment to be as fast as possible; minimizing downtime on production is a key reason. There are a couple patterns I’ve seen used:

Don’t Take a Backup During the Deployment, Instead Trigger it Separately

This might work well if you have a highly-cached site with few or no changes to the application state and can inform anyone with access to the Drupal admin to not be in the system during the deployment window. Be cautious of trying to do this with commerce sites, sites that accept user-generated content, or sites with integrations that result in frequent changes to Drupal’s state.

Don’t Clear All Caches, Instead Selectively Clear What You Need

This is a good fit for large multi-sites, commerce sites, and sites with lots of traffic. To achieve this, you’ll need to be aware of any processes that might be clearing cache. Notably, you’ll want to use the --no-cache-clear option with updb to avoid it running drupal_flush_all_caches().

You should still clear some caches on deploy like entity definitions, but caches like dynamic page cache and HTTP proxy caches (e.g. Varnish) can now be cleared more cautiously.

Improving Speed of Restore

Likewise, the speed of restoring your site in the event of an outage or degraded service due to a deployment can be critical to the business. A few ideas I’ve seen include:

Don’t Take a Backup Before Restore

This is a risky path to go. The amount of coordination and effort it takes to plan a release can be significant. You will want all the tools and data you can get to successfully diagnose the cause of a failed deployment so that you can resolve it for the next. If you don’t take a backup before you restore, you may be missing out on the critical information you need to diagnose. This approach can work, but use it cautiously.

Don’t Restore the Media

This is sometimes low risk. Drupal does a good job of keeping the original files for media entities intact and you can easily regenerate image styles. However, there are at least three be-cautious-of-sites that have a large amount of media: you may be putting yourself in a worse spot if you have to re-generate lots of image styles due to a configuration change. Also be aware that if changes were made to media after your restore point was taken, you may end up with data that is inconsistent with the actual files on the filesystem.

Always Restore the Database

Always, always restore the database during a rollback. Skipping this step will cause you more harm than good. The only exception that comes to mind would be if you are rolling back a very minor code or config change that you know is idempotent. A rule to consider is this: you can rollback a deployment whose changeset is equal to or less than the number of years you’ve been managing Drupal deployments. If you’ve been doing it for two years, then you are allowed to evaluate whether to rollback two lines of code change.

How To Identify Restore Point on Rollback

Performing a rollback assumes you have a way to uniquely identify the restore point associated with the last good deployment. You can’t simply restore to the latest restore point because the rollback script might also be creating backups.

If your backup process is capable of associating a custom label with the backups (such as Acquia Site Factory), you could set the label to the tag that was checked out when the backup was taken and whether it was generated as part of a deployment or a rollback. Then, you know if you see a backup labeled "deployment-5.3.8" it could be used as a restore point when deploying 5.3.9. And, if you see a backup labeled "rollback-5.3.9" then you know that it was the backup taken during the rollback process from 5.3.9 to 5.3.8 which can be used for auditing. In the case of multiple backups with the same label, you would take the most recent.

Assuming you don’t have the ability to label your backups you would need to identify something else as a unique identifier and pass that as a parameter to your rollback script. This could be an actual identifier number or a timestamp that includes seconds.

Update Your Data, But Only After Config Changes Are Applied

Needing to update data after config changes have been applied is a common use case. For example, maybe you create a new field on a node and you want to move data from the old field to the new one. There is a Drupal ticket to create a hook_post_config_import_NAME which would allow you to achieve this very goal! What can you do in the meantime?

In that ticket, a comment explains that you can use hook_post_update_NAME and modify how you run the update process to only run it after config import. You would change your deployment steps to use the --no-post-updates flag when running updb and then run it an additional time after cim like so:

drush updb --no-post-updates
drush cim sync -y || drush cim sync -y
drush cim sync -y
drush updb

You can see the Massachusetts’ government website (mass.gov) has included this in their deployment steps. See their Deploy Commands class.

Alternate Workaround: Change Configs in the Update Hook

If you can’t use the hook_post_update_NAME workaround, there is an alternative that is well documented in the notice Support for automatic entity updates has been removed (entup). There are several code examples on how to modify configurations, but they can be pretty cumbersome, even duplicative if the configurations you need are already part of the configuration synchronization process.

There isn’t a very easy way to just import a configuration you need. That’s where the Config Importer and Tools module comes in handy. With this module you can now import configurations in just a couple lines:

$config_importer = \Drupal::service('config_import.importer');
$config_dir = \Drupal\Core\Site\Settings::get('config_sync_directory');
$config_importer->setDirectory($config_dir);
$config_importer->importConfigs(['mymodule.settings’]);

The key parts are to set $config_dir to the directory containing the exported configuration files you want to import, and pass the list of configurations you want to import to importConfigs. The project page shows other examples, including how to import configurations from a module directory.

Need to Reindex Search

If your Search API configurations are managed in code, you’ll want to ensure that on deployment your search index is updated. The search index process has potential to take a lot longer than you would tolerate during a deployment window. But, there’s an easy compromise.

Add this line at the end of your deployment script:

drush search-api:reset-tracker

This will mark the indexes you have to be reindexed, but it doesn’t delete the data. So, your search should still be functional while your cron takes over to do the reindex. Be mindful of cases where not clearing the index will cause issues (e.g. template files expecting a new field that isn’t available yet in the index).

Ideally, you would be able to write code changes to accommodate stale data in the search index, but sometimes you can’t avoid it. In that case, you might be better served by extending the deployment window to include enough time to completely reindex the search content.

Also, be mindful of how long it takes to reindex your content. If it takes hours or days to reindex, you may want to increase the frequency of that cron task, increase the batch size for each cron run, or just plan for the reindex to be part of your deployment window.

Need to Preserve State in Pre-Production

Imagine you have a module that uses the OAuth protocol. You do not want the production and pre-production environments to use the same refresh tokens because one of the environments will use that token first and the other will receive unauthenticated errors. Some people will simply manually re-authenticate the environment after a deployment to pre-production, but is there an automated way?

Yes! In your pre-production deployment you can add something like the following as the first line:

DRUPAL_REFRESH_TOKEN=$(drush sget mymodule.refresh_token)

Then, after you run config import but before you disable maintenance mode, you can set those variables again:

drush sset "mymodule.refresh_token" "$DRUPAL_REFRESH_TOKEN"

When implementing this idea, you’ll want to expect a few failed deployments until you get the exact syntax right. Capturing output like this using Bash can cause some tricky issues. Also, be cautious of these commands failing. If either command fails, do you really want the deployment to fail?

Probably not, and here’s why. Imagine a scenario where you run a deployment and the deployment fails on the last step. You realize you have a small code change to make, which you do, and deploy again but it fails on the first step. In effect, you can’t deploy new code to pre-production because it can’t get past the first step! So, you may consider rewriting those lines to guarantee a zero exit code even if it fails.

Pre-Deployment Checks

We won’t cover checks that you might have in your entire release pipeline here. But, there are a couple checks that are useful to do right before you trigger a deployment to production that are worthwhile to consider.

Overridden Configurations

If you’re not using the Config Read-Only module, then there will be that one time that a configuration gets overridden on production and a deployment wipes out that change. The problem of ensuring configurations on production are never overridden can be difficult when governance involves many people with varying needs and opinions about access. Also, you’re bound to run into that rogue module that is storing transient state in config entities instead of using State API or Cache.

An easy solution to keep good hygiene is to add this line as the first step in your deployment script:

drush config-status | grep -q 'No differences'

Now, the deployment will halt at the first line if there are differences. You can check to make sure whether you want to preserve those changes by either manually re-applying them after deployment, or exporting those changes to code and rescheduling a deployment that can include those changes.

Creating Tags in the Right Branch

Have you ever accidentally created a tag from the wrong branch? It can happen to anyone and the impact is significant: you’re deploying the wrong code! There are a lot of ways to prevent that, but one way would be to confirm the tag you’re deploying is on the intended branch programmatically like so:

git branch master --contains "tags/$BITBUCKET_TAG" | grep -q master

In this example, "BITBUCKET_TAG" would be an environment variable with the name of the tag you’re deploying. Doing this check as part of your deployment may not be possible or it may be too late in the deployment pipeline. Typically, this check would go in your build step that gets triggered on tag creation.

Tag Doesn’t Exist

If you have a process that runs to build your deploy artifact and then trigger an automated deployment, you may want to make sure the deployment doesn’t run unless the expected tag is available on to your production environment. For example, it would really stink to trigger the deployment and the deployment process puts the site into maintenance mode, takes backups, and only then notifies that the deployment has failed because the tag doesn’t exist. Let’s avoid that.

Writing the check will be dependent on your hosting environment. If you’re using Acquia Cloud, there is an endpoint at "/applications/{applicationUuid}/code" that returns a list of references. If you’re doing a regular git checkout, you can use rev-parse like so:

git rev-parse "$BITBUCKET_TAG" > /dev/null 2>&1

Again, we assume "BITBUCKET_TAG" is a bash variable defined in the script’s runtime. This check is best to include as part of your deployment script.

Someone Committed a Git Submodule

I have to throw this one in here in case anyone is not able to exclude composer dependency directories (like vendor, docroot/modules/contrib, docroot/themes/contrib, etc.) from the repository you’re working out of for development. Have you or someone on your team cloned a module or repository instead of using composer and then committed that module? In effect, it creates a git submodule which can cause surprising and hard to diagnose issues.

Well, luckily there’s an easy check you can use:

git diff --quiet

This line runs a git diff to check whether there are any changes; if there are, it returns a non-zero exit code. If the commits you’re pulling (or cloning) have a submodule the diff will show that as a change which then causes the build or deployment to halt (this is assuming you aren’t running "git submodule update" beforehand).

This check is best to put in as early in your deployment pipeline as you can so that it is caught early. If you must include it as part of a deployment, you can put it immediately after the step that deploys code to the environment.

Review the Deployment Changes

Apply Linus’ Law wherever you can, which drills down to: the more people who look at code, the more obvious issues become. It’s perfectly reasonable and encouraged to review your entire release before you trigger a deployment. If you are the one responsible for issues if a deployment fails, you will want to have a survey of what all has changed so that if and when there’s an issue, you will already have a few ideas on where to look.

A Word of Caution

If you are responsible for the success of your deployment and rollback process, you have a difficult balance to keep. If you continually make changes to your deployment and rollback process, you risk introducing failures. Even a single deployment or rollback failure can shatter the trust and confidence in the technology and the people supporting it. But, if you never make changes and have no plan for handling changes to the deployment and rollback process, you’re planning to fail.

Test your deployment and rollback process with every change. This is a perfect use case for having a pre-production environment where you can use the exact same deployment and rollback scripts. Depending on the constraints of your hosting provider, you could consider other deployment models like blue-green and rolling deployments. Regardless of deployment model, ensure you are getting early feedback on any failures after a deployment so that you can fail-forward with a quick change to fix it or rollback.