Bleeding In Edgeville - When Upgrades Come Too Fast (Dear SQL DBA)

on May 26, 2016

In this episode of “Dear SQL DBA,” I answer a question about early adoption of SQL Server, discuss why testing in production isn’t necessarily crazy, and recommend how to handle requests to upgrade your SQL Server to use new features.

The question…

Dear SQL DBA,

What are your thoughts on the early adoption of new SQL Server versions? Specifically, if the business is willing to assume the risk of early adoption just to get one new feature that they can probably live without, should DBAs be happy and willing to assume that risk too? Or, is it our responsibility to “just say no” until it has been tested? I would like to hear about any experience you have with this. Thanks.

Bleeding in Edgeville

Some people out there are jealous of this problem

Some DBAs have the opposite problem. They are stuck supporting super old versions of SQL Server. This bores them. It also presents a problem for their resumes.

But for all you DBAs stuck on old versions of SQL Server, early adoption can be hard to live with, too.

  • You run into problems that aren’t documented yet
  • May be rushed to apply a patch for problem A and run into problem B
  • This is usually done so developers can use new bleeding edge features
    • No going back
    • Those features can be more prone to problems
    • Hard to figure out if the problems are your new code or the feature itself

There’s a lot of calls in the middle of the night when you’re using bleeding edge features.

Questions to answer when early adoption of a new SQL Server version comes up…

  1. What’s the tolerance of downtime? (Recovery Time Objective)
  2. What’s the tolerance of data loss? (Recovery Point Objective)
  3. Performance requirements. Is poor performance considered downtime?
  4. How many people can respond to incidents, and what are their skills? Does this match up with the tolerance for downtime and data loss? Is a developer available on call to help determine if issues are a coding problem instead of the SQL Server?

It helps if you have more instances!

Large environments have some instances where downtime and data loss aren’t as big a deal

Instances where teams can run bleeding edge versions:

  • Multiple instances of the same type, perhaps behind a load balancer
  • “Nice to have” instances that applications can run without, or can be diverted to a different instance
    • Logging?

Remember: If you can’t deploy to a test environment first, you’re probably doing it wrong. That implies you don’t have a test environment.

Like Noah’s Ark, you don’t want to have one of something.

Testing in Production … can be awesome

This goes for upgrading SQL Servers code as well as rolling out new database code. DBAs are often afraid of this. But it’s not necessarily as bad as it sounds. It can be great when done well.

Imagine a multidatacenter environment, where you can balance traffic

  • Control percentage of users in a datacenter
  • Control what functionality they are using
  • Can fail out to another datacenter whenever you need to

These environments are expensive

  • Hardware investment
  • Operational investment
  • Coding investment

The investment is worth it to the companies who use them so they can ship new features faster

There are still problems with these systems

  • There’s often a parent / upstream system
  • The databases there often are not duplicated in a loosely coupled day
  • Deploying changes / upgrades to the databases in those systems is still tricky

Revisiting the question: What is the DBA’s Responsibility?

If you’re a Senior DBA, you should follow and build / improve change management practices. (Even if you’re a Junior DBA, this is worth doing.)

Document Service Level agreements for your databases, make sure this includes both:

  • Recovery Point Objective (data loss)
  • Recovery Time Objective (downtime)
  • Performance requirements. Is poor performance considered downtime?

Warn about risks to these agreements for any changes.

When incidents come up, get to root cause and follow through on proposing a way to prevent it from happening again. Keep doing this even if your proposals don’t get taken – they won’t each time. Be persistent and positive. Do not get stuck in the “wheel of blame.”

Design and propose a “path to production” for new SQL Server versions

  • No low risk instances to test on? Propose them.
  • This may require paying for some hardware and licenses. If that cost makes it safer and easier to adopt new technology, it can be absolutely worth it to business owners

If you are overworked and nobody has time to get to root cause, raise this as a risk. You can’t prevent repeat cases of problems if nobody finds cause.

Don’t say “no” – say, “Here’s what we need to make this happen. Can you help me get this into our budget?”