Limiting Downtime for Schema Changes (Dear SQL DBA Episode 25)

You need to release schema changes while the SQL Server is in use. Learn why code generation tools write odd scripts and how to stay sane amid rapid releases in this 28 minute video… or scroll down to read a summary.

If you enjoy the podcast, please write a review on iTunes!

Dear SQL DBA…

How can I achieve near zero downtime deployment with database schema changes? Can I use Peer-to-Peer Replication or some other feature to make this easier? It seems very complicated.

We use SQL Server Data Tools (SSDT) and the developers work agile and quite often make small changes to the schema. And we have found there are some snags with it, since SSDT generates the change script.

Here is an example of a common problem: when the developers add a column to a table you often get a Create/Insert/Drop Table create script by SSDT. Avoiding table locking when possible would help achieve near zero downtime deployment.

Peer to Peer Replication makes schema changes harder to manage, not easier

not-sure-if-uptime-or-downtimeI totally get why it seemed like P2P might help with this: it’s the only technology in SQL Server where you can have multiple “write” nodes for the same database.

Microsoft’s documentation on Peer to Peer replication says

Publications must allow schema changes to be replicated. (This is a setting of 1 for the publication property replicate_ddl, which is the default setting.)

I’m not sure if that’s a hard requirement (and it just won’t work if you disable replicating schema changes), or if your use of P2P is unsupported if you disable replicating schema changes— but you want to meet the requirement either way.

The requirement simply makes sense when it comes to data consistency. If I drop a column in a table on Node 1 at 2 pm, should the application be allowed to insert rows into the same table on Node 2 at 2:05 pm if I haven’t applied the schema change to Node 2? How would that those rows get replicated back to Node 1?  This would put the column into a Shroedinger’s cat situation: does the column exist, or not?

Other things to think about:

  • Sometimes code changes may break because they aren’t allowed by replication (such as changing a Primary Key)
  • When you have new tables, they have to be either manually added, or you need to manage the TSQL to add them (articles aren’t added automatically)
  • Adding or dropping articles requires quiescing, which is basically downtime for writes
  • Changes of a very large volume of data can make the replication get behind and the system harder to manage
  • You need to make sure that writes have been fully propagated to all nodes before moving the application to read from another node, otherwise writes they may have seen already could disappear
  • You have to manage conflicts if the same row is modified on more than one node
  • Peer to Peer replication is also not compatible with lots of new features at this time too (In-Memory tables, Availability Groups), which makes its future seem uncertain
  • Peer to Peer replication is Enterprise Edition only, so if you don’t already have the licensing for multiple nodes, it’s a very expensive gamble.

Peer to Peer replication wasn’t built to solve this particular problem. You can manage schema changes while using Peer to Peer, but it’s tough stuff.

Enterprise Edition has a cool feature when it comes to adding columns

For the example you mention of adding a column, some automatic code-writing tools do the “create a new table, insert, rename” operation because adding a column in place can sometimes be a very large, painful alter because it’s a “size of data operation.”

When you add a not-nullable column with a default value, SQL Server may have to go through and update every single row on the table to add that default value. That’s a really big in-place operation, and if you cancel it midway, you may be in for a nasty rollback on a large table. That’s why some folks gamble on just loading up a new table and switching it in, instead. In some cases, it’s faster!

I say adding a column may be a size of data operation because you have options:

  1. SQL Server Enterprise Edition has a feature where adding a new column with a default value avoids locking the table with a size of data operation.  That feature was added in SQL Server 2012. Remus Rusanu wrote a great article about it, Online non-NULL with values column add in SQL Server 2012.
    • This doesn’t work for all data types, and it only works for a constant (not something like NEWID where every row has a different value). Read Remus’ post for more details.
  2. You also have the choice of adding your new column as nullable, and writing code to manually work through the table and populate the rows in batches.

Super awkward – you might have that Enterprise feature everywhere except production

If you have Developer Edition in your development and staging environments, but Standard Edition in production, changes adding non-null columns with default values could run very quickly everywhere except production on your large tables. That’s because Developer Edition gets all the features of Enterprise Edition.

Note: I also talked about size of data operations in a previous Dear SQL DBA episode when I discussed another schema change: altering an INT column to a BIGINT.

Avoiding downtime with schema changes is mostly about process and experience

whoopsI’ve worked in a bunch of environments where we rolled schema changes to production throughout the week, and worked hard to limit downtime. Our biggest challenges were:

  • The code released wouldn’t have the intended effect, and multiple hotfixes would sometimes need to be written quickly if the issue was urgent.
  • Problems would happen with replication in production – either replication would be delayed because a lot of modifications were occurring, or a change in an “upstream” system would cause a huge amount of data processing to happen in a “downstream” database

But while there were bumps along the road, we got pretty good at it! It’s not easy, but the more you do it, the better you get at it.

Push schema changes before application changes

Adding a new feature? Tables and columns get added to the database well in advance of anything using them. This makes it easier if the schema changes need to be rolled back. Or if something goes wrong and the schema change doesn’t succeed, that whole release doesn’t have to be rolled back.

Test in a full scale pre-production environment

You need a staging environment to roll changes to before they hit production. Key points:

  • You need a full dataset. If production is a 2TB database, you need a 2TB database here.
  • This can’t be a dev environment. Dev environments are chaos, and have a bunch of pre-release code in them. This needs to match production.
  • The environment needs to be configured like production as well– if production is a cluster with replication, this environment needs to have a cluster with replication as well.

As long as you use this environment for testing and validating production changes, you can license this with Developer Edition — but note that “super awkard” limitation above where adding a column could be fast everywhere except production. And you still need to invest in storage and hardware. (It’s possible to go cheap on the hardware, but if you do so then you won’t get a reliable estimate of how long it’s going to take to deploy changes to production.)

If you’re just starting down the “deploy all the time and minimize downtime” road and you don’t have this, that’s OK.

When people start asking, “How can we make sure these mistakes don’t happen next time?”, this is your #1 suggestion.

Some code needs to be custom

For smaller changes, it may be OK to use automatically generated code to roll the change– but larger tables and sensitive databases can require custom coding to make sure you can do things like take advantage of that feature above, or custom code it.

The DBA usually finds out when there’s a need for custom code when they’re pushing a change to the pre-production (aka staging) environment, and it takes longer than expected.

It needs to be OK to screw up and have downtime

This was totally counter-intuitive to me. The whole point is that you’re trying to limit downtime.

But this is hard stuff. Things are going to break. Everyone needs to be able to learn when that happens, and being afraid doesn’t help people communicate better.

You need to do postmortems when you have downtime, and you need to have postmortems that don’t suck!

The hardest part of making this work is preventing the human tendency to blame from ruining the whole thing.

But we really can’t have downtime because of X, Y, Z

If it’s really not OK for schema changes to cause downtime, because people will be harmed, too much money will be lost, or some other reason, then either:

1. Releasing schema changes to SQL Server while production is active isn’t right for that application.

2. You need a highly customized architecture for each application designed to make this work and avoid downtime.

  • For some applications, you may be able cache the reads in the application tier, or a read only copy of the database, and queue the writes while you’re doing the schema changes.
  • Some applications may be able to be sharded into smaller databases, so data sizes are kept small. Schema changes on small tables are much simpler than large tables.

There may be other options as well, but it’s going to be a custom solution for each application.

, , , , , ,

2 Responses to Limiting Downtime for Schema Changes (Dear SQL DBA Episode 25)

  1. Michael J Swart December 15, 2016 at 10:55 am #

    I always thought that waterfall was named after the look of a gantt chart. It’s tied to non-agile projects because there’s typically no iterative piece to planning when using gantt charts.

    We’ve worked very hard at this problem of zero-downtime migration changes and I’m glad you’re talking about it.

    The key for us was to decouple schema changes from code changes (just like your advice “Push schema changes before application changes”).
    For every change, we
    * deploy “pre” sql scripts (enabling backwards compatibility for both code versions)
    * make the code changes (using newly deployed web servers)
    * deploy “post” sql scripts (cleaning up)

    We used to have an “offline” category of sql scripts that can only run offline. But in three years, we’ve never needed to use that category.

    (Of course we’re very fortunate to be able to use Enterprise edition)

  2. John Songy December 15, 2016 at 12:15 pm #

    Excellent treaties on limiting downtime for schema changes. Thank you for covering so many scenarios, and going over some of the possible solutions.

    I have worked for a company doing agile development with weekly updates on an environment where no downtime is acceptable for financial reasons. We used a combination of approaches, most of which you covered in your presentation. We did use peer-to-peer replication. I do not recommend peer-to-peer replication for the reasons you mention (and more); however, it worked for our needs.

Leave a Reply