Required Testing for Installing SQL Server Cumulative Updates and Service Packs

Cumulative-Updates-And-Testing

Microsoft recently updated their policies and recommendations for installing cumulative updates. Now, in the header for each cumulative update, it reads:

Microsoft recommends ongoing, proactive installation of CUs as they become available:

  • SQL Server CUs are certified to the same levels as Service Packs, and should be installed at the same level of confidence.
  • Historical data shows that a significant number of support cases involve an issue that has already been addressed in a released CU.
  • CUs may contain added value over and above hotfixes. This includes supportability, manageability, and reliability updates.

But Important Things Can Go Wrong in Cumulative Updates and Service Packs

It says something else in the header of the cumulative update:

  • Just as for SQL Server service packs, we recommend that you test CUs before you deploy them to production environments.

That’s because nobody’s perfect. Something may break in your environment after you install a Service Pack or cumulative update. Microsoft tests the releases, but you need to test them, too.

You Should Still Be Installing Cumulative Updates

Some folks are in the habit of waiting for a while after a Service Pack is released to see if issues arise or if the Service Pack gets patched. That isn’t a great plan either. You end up way behind on the release cycle. Whenever you have a performance issue, it’s extra work to figure out if your problem is fixed in the releases you haven’t installed yet.

Don’t be cynical about Service Packs and Cumulative Updates. Be better at testing.

So what should you test?

I’ve got a list!

1. Test Query Performance. You Need a Benchmark Test.

Install the cumulative update into a non-production environment (prod test or staging) and run application activity against it.

Ideally, you want this to be representative activity for your real, production applications, run from multiple application servers. You also want the hardware and storage to be as similar as possible.

If that’s not possible for you to set up, you need some sort of tool to generate concurrent activity. There are a lot of options — Michael Swart gives a great run down here.

Whatever you choose, implement it in a way that mimics your production environment as much as possible. Think about things like these:

  • If production uses in-memory tables, your benchmark should use it
  • If a database on the production instance uses Transparent Data Encryption, a database on  your benchmark server should, too (it impacts tempdb for all databases)
  • Your security setup/ authentication should mimic what you use in production. Use kerberos in prod? Use it in staging.

2. Test Backups, While the Benchmark Is Running

This one is super easy to forget, because most people don’t run backups outside of production.

But don’t forget testing full and log backups while activity is running! Yep, Service Packs and cumulative updates can break those, or significantly change performance on them. Don’t let this be a nasty surprise you only learn about in production.

Backups in your staging environment may be normally slower than backups in production. That’s OK, just make sure you track what’s normal for the environment, so you can tell if it got significantly slower.

3. Test Index Maintenance and CHECKDB (Yep, While the Benchmark is Running)

Similar to backups, you want to put all your maintenance through its paces. Don’t run it all at the same time, run it serially.

Most benchmark workloads can run at different levels. You may choose to normally run your benchmark at a lower level while running index maintenance, and that’s totally fine — a benchmark that hammers the staging server to its limits hopefully doesn’t mimic your production environment. Just be consistent and know what normal times are.

4. Test Failovers (if You’re Using High Availability)

This is so critical. I learned that back when I applied a Service Pack to a production cluster, and then found it wouldn’t come online when I tried to fail it over. We had a configuration issue that we only found in production because we didn’t have any clusters outside of production.

At least, we didn’t before that incident. We sure had them afterward.

If you’re running Availability Groups with three replicas and a logshipping subscriber in production, you need the same topology in pre-production/staging. And you need to test and plan exactly how the upgrade will roll out, when you’ll fail over what to where, and don’t forget to make sure that it fails back, too.

5. Test Application Jobs and Scheduled Packages

Make sure that you’ve got the same scheduled SQL Agent jobs and regularly scheduled processes set up against your staging server, and that they all run normally and in good time.

Never assume that staging matches production exactly before you start. Yep, it should,  but it never hurts to make sure that the same jobs are set up the same way.

Do You Have Time to Do This Every Cumulative Update?

Cumulative Updates are released every two months. Your job probably includes more than testing Cumulative Updates and Service Packs. You may not have time to do this properly– and if performance and availability is important, you shouldn’t cut corners on this.

In a perfect world, you’d test and evaluate every CU. The more imperfect your world is, the harder that is.

Compromise: Don’t Install Every CU, But Read the Release Notes and Update Regularly

When I first became a DBA, our team had the following policy:

  • Pick a Service Pack + Cumulative Update combo that was recent, and establish it as stable across production
  • Read release notes for all hotfixes, Cumulative Updates, and Service Packs and look for fixes that might be relevant to the environment
  • Update to a new SP+CU level a few times a year, either when a compelling fix was released or if a full quarter had gone by with no change
  • New SP + CUs always burn in on development instances first
  • They then proceeded to staging /  pre-production
  • In staging and pre-production, DBAs kick the tires, fail the instances over, and look for deal breakers  (the testing plan above)
  • They then went to less critical production instances first, and gradually worked up to more critical instances
  • Exceptions were available for critical hotfixes / breakfix, and those issues always involved escalation to Microsoft Premiere Support
  • Change Management was used all the time

It worked really well then, and this plan still works well with the new Cumulative Update policy.

How to Stay Current on New Releases

Cumulative Updates not only contain important fixes, they also sometimes contain interesting little features, like trace flag 7471.  To stay fresh, you can:

, , , , , , , ,

14 Responses to Required Testing for Installing SQL Server Cumulative Updates and Service Packs

  1. Phil Ekins April 29, 2016 at 11:44 am #

    “Most people don’t run backups outside production” ? Outside the point of your play but REALLY ?!? Developers get real crabby when you tell them they have lost 2 months of sp changes…. I very rarely come across this and flag it as an issue if I do…. Are you seeing it that often ?

    • Kendra Little April 29, 2016 at 11:46 am #

      Sure, all the time. Most developers check their changes into a branch of source control, and would cringe at relying on a dev environment database or backup for their code changes.

      Do your developers really not check code into source for months at a time? Do they not know how to branch? (I’m actually a little concerned that this means they aren’t checking in database code.)

  2. Morden Kain May 2, 2016 at 11:16 am #

    As the DBA in charge of updating and backing up our SQL Servers, I can attest that we have backups for our Dev and Test environments for the past 6 months, with the ability to roll back to a specific day for 30 days. Yes, our developers are a bit spoiled in that regard. I also pull prod data back to Dev and Test as well. Or I can do a restore on to Dev or Test of any database from Prod.

    As for testing, something we found out by accident, and are still testing is that Trace Flag 4199 is breaking one of our DacPac builds. We ended up turning off that TF even though it was turned on to implement a fix for ITF/UDFs with SQL Server 2014 SP1 CU6 (yes, we are currently testing CU6 which came out two weeks ago).

    I agree though with the statement not to install every CU. I track the CUs within a spreadsheet and when there are enough changes/fixes to warrant the update, I set it up for the following month. In the case of CU6, we got it fast tracked as we are currently fixing a lot of ITF/UDFs that were working perfectly until we switched compatibility from SQL Server 2008 to SQL Server 2014.

    • Kendra Little May 2, 2016 at 12:31 pm #

      Hey, ain’t nothing wrong with spoiling developers. I’m in favor of backing up those environments for the sake of RTO — it can be a real nightmare to reset them when things go wrong, and if developers can’t work then the company is losing money. Most folks just don’t want to invest in the storage, so they end up doing much fewer backups and keeping far fewer of those environments. I do think your way is better (as long as people are properly using source control, but that’s really the developers’ responsibility).

      I would still test log backups in at least one environment for systems where you use them in production.

      That’s really interesting about TF 4199! Thanks for sharing the info.

      For the compat mode issue, any reason you stuck with the higher compat mode and didn’t just lower it back down to use the old cardinality estimator? Curious if you had some things speed up from the new CE that made up for the regressions?

      • Morden Kain May 2, 2016 at 1:19 pm #

        Yep, TFS is the reason. We switched all our projects over to SQL Server 2014. Once you make the switch, you cannot go back.

    • Morden Kain May 2, 2016 at 4:24 pm #

      Well, the developers just finished testing out the DacPacs, and found that there are a couple of registry keys that default to a rather low value of 5 seconds for SqlPackage.exe. They bumped it up initially to 1000 seconds, and are currently bring it back down to see where the break point is for the environment. The TFS builds are now succeeding with Trace Flag 4199 on. Here are where the registry keys are located:

      HKCU\Software\Microsoft\VisualStudio\\SQLDB\Database\QueryTimeoutSeconds

      You can find the same information here:

      https://social.msdn.microsoft.com/Forums/sqlserver/en-US/b00f58d4-cb6e-42ba-b48e-594517288555/ssdt-timeout-when-deploying-a-database?forum=ssdt

      • Morden Kain May 6, 2016 at 4:34 pm #

        Update: Well, we turned back off 4199 after finding that the fix we thought was working (see above), was not. However, we did finally find the culprit. SQLPackage.exe was executing a query for spatial indexes (which we have none listed). The query was taking 51m 15s to return 0 rows. My co-worker found an article stating to update stats. We tried that, no dice. What I did find is that there are four tables that the update stats did not touch that belong to the SYS schema. Once I updated those, TFS was no longer throwing errors on builds. It actually decreased the build time as well. I posted the fix for this on StackOverflow under the original posters answer (as his fix did not work for me):

        http://stackoverflow.com/questions/36494523/ssdt-publish-timeout/37081880#37081880

  3. Golden Gal May 2, 2016 at 11:16 am #

    So, Kendra, what if…(and this is not such an unusual situation) the MSP has to deal with customers who will NOT pay for a separate dev/test environment? Suggestions on how to circumvent the “testing a service pack on production” situation?

    • Kendra Little May 2, 2016 at 12:43 pm #

      Some IT service providers just explain the risks to the client clearly, and will do the patches untested to production. If they bill by the hour, it’s not necessarily a bad deal for the service provider in the short term.

      However, I don’t like that approach, because it ends up yielding more breakfix hours and an unhappy client. Even if the risks are explained, clients don’t tend to like unexpected surprises. If the unexpected surprises cause data loss, it can be disastrous.

      I have known some IT service providers to simply require a testing environment for upgrades, and also require that upgrades require a certain amount of work hours for the testing and production deployments. It doesn’t have to be a permanent testing environment, it can be a temporary VM spun up just for the purpose of running the testing. Virtualization has made the discussion of testing upgrades much easier by giving more options.

      Essentially, part of being a service provider is setting standards and advising the client about the right way to do work. My experience has been that when the standards are set too low, there’s too much breakfix and it can damage the client relationship, sometimes ending it. So it’s well worth figuring out what your minimum requirements are for patching support (for your own revenue concerns), and sticking to it.

  4. Wendy February 9, 2017 at 10:33 am #

    Kendra, thank you for the article. I have a few questions if you can kindly answer.

    1) If you AG group servers are on SQL2012 SP1 CU3, can I upgrade them directly to SQL 2012 SP3 skipping SP2, then apply the CU7 (latest CU of SP3) ?
    2) If I have a scheduled downtime for these three servers in the AG group that are on manual failover, can I just upgrade the primary server, follow by replica 1, then replica 2 in sequence? You don’t need to shut down all the SQL servers or take them out of AG to perform the SP upgrades correct?
    3) Do you have any link/resource or aware of that AG/replication that will work fine on servers with sp1 along with some servers with sp3 cu7? Say the subscribing server is on sp1, but some publishing servers are on sp3 and some are on sp1?
    4) If I have 10 sets of those 3 node AG servers that replicate to a central server, can those publishing servers be on different SPs? I don’t have capacity to upgrade all servers at the same time by myself.

    Thank you so much!!

    • Kendra Little February 9, 2017 at 12:39 pm #

      Hi Wendy,

      Generally speaking, you can “skip” service packs. You can also create “slipstream” installation media to apply a service pack and a cumulative update in one go.

      The biggest risks I see in your question are that it sounds like you don’t have a staging environment, and you’re running a v1 version of Availability Groups.

      Even if you found an online reference about a replication topology with the (very) specific versions that you’re talking about, how would you know that they were using replication in the same configuration and topology that you’re talking about? The only thing that can give you minimum reassurance is a staging environment where you can run tests on your topology before you ever push to production.

      If I were you, I would take this patching problem as an opportunity to lobby for that staging environment so that I could answer these questions, especially given that folks testing and blogging about AGs are doing so with SQL Server 2016. (It may take more than one conversation to get that completed, but it’s worth it.)

      Amit Banerjee has blogged recently about upgrading replication topologies, and it includes information about different versions that are supported on publisher / distributor / subscribers and allowed versions. He’s talking about major versions, but the information can help inform your upgrade order even on SPs and CUs: https://blogs.msdn.microsoft.com/sql_server_team/upgrading-a-replication-topology-to-sql-server-2016/

      Hope this helps,
      KL

Trackbacks/Pingbacks

  1. Testing Cumulative Updates – Curated SQL - April 29, 2016

    […] Kendra Little discusses how to test cumulative updates: […]

  2. Install Cumulative Updates – Voice of the DBA - May 1, 2016

    […] good news, though as Kendra Little notes, you still need to test. Bugs will still exist in patches, and really all software, so it’s important that you test […]

  3. Index Usage Stats Bug Fixed in SQL Server 2012 SP2 CU12 / SP3 CU3 - by Kendra Little - May 18, 2016

    […] Friends don’t let friends release Cumulative Updates without testing. […]

Leave a Reply