Avoid an anti-pattern and decode the blocking pattern (13 minutes)

Some solutions are risky

Discover why you should avoid one popular online workaround for this blocking problem.

Learn which columns in sp_WhoIsActive give critical information to what’s going on with this ugly pattern.

Transcript

Welcome back. We’re gonna talk through the solution to this problem.

The first thing I want to talk through is an anti-pattern for the solution

This is something that I don’t want you to fall for, that I have fallen for, in the past.

If we take a look at the queries that are running, by running sp_WhoIsActive again.

Wow, everybody has really been blocked for a long time now! The SQL Server – for this guy– look how faithfully it’s been sitting there with its lock wait waiting.

SQL Server is just like, “I will wait for you… until… until I can go.”

But one thing that I might be tempted to do is to look at the query for this guy, the blockee. And I do that again in sp_WhoIsActive just by clicking on the query text.

I might say hey, can I change this query so that if somebody else is holding a lock on the table, it can just say, “Hey, I’m gonna go right past you.”

I could take the text of this query and say an easy way I could do that is by taking the text and adding a NOLOCK hint onto it, like this.

See the NOLOCK hint right here?

And the reason you might do this is you’re gonna find blog posts and forum entries all over the Internet saying, “Hey, you can make your query faster.”

You can get this to sneak by blocking by saying, hey, don’t you take out any locks, and that sounds good at first, right?

In fact, if I run this guy, that blocker is still running, we saw that in sp_WhoIsActive right, like, it’s still there, here’s our blocker. If I add this NOLOCK hint, and run this, wow, it snuck right past.

It didn’t get blocked, after I added that NOLOCK hint this is the exact same thing, that NOLOCK hint is, as saying, I want to set the transaction isolation level for my session to read uncommitted, NOLOCK equals read uncommitted, the “un” is very critical there, because the default isolation level for SQL Server is to read committed.

We’re saying read UNcommitted.

The reason that you don’t wanna do this, or that you almost always don’t wanna do this is…

When you say either NOLOCK, or read uncommitted on your query, what you’re saying to the SQL Server is:

I don’t care if the data that you return to me is correct.

I want you to just do a read, and even if someone else is changing the exact data, that I’m reading, even if they’re like, in the middle of changing it, just show me whatever you got right now.

And right now, that data is being changed, that’s the reason that other query was blocked is that the other query is running an update that is changing a Bill To Customer ID, it’s taking the Bill To Customer ID’s, that are set to one, and it’s changing them to two, but it hasn’t committed yet.

So, when I run my dirty reads — “data is not necessarily, and in fact is incorrect”— query, it’s saying I only see the two’s.

Now, what if this was in a report run by my CEO?

Say she’s looking at important billing data, and what if somebody goes in, and kills the query that’s doing the update, because it never committed, right?

And this is a very likely thing that happen, if it’s just stuck, and it’s in a what’s called doomed transaction. It could roll back.

Then she runs the report again, five minutes later, and she sees completely different data.

She might say: what the heck is going on here? Especially if something just doesn’t look right, right?

So, we don’t – most of the time– I say most of the time, because there are some databases where I’ve had people say actually, it doesn’t matter if the data is wrong, we just want a rough idea– and if some people get wrong information it’s okay.

I have had those cases, but they are very, very rare, and sometimes when I say, really, and ask them questions about it, the answer turns around, and becomes actually no, it does need to be right, most of the time, 99% of the time, when we bother to put our data in a relational database, we want the data to be correct.

Don’t just jump and throw NOLOCK hints on these queries

Because what you’re saying is maybe people can start seeing wrong data, depending on when they run stuff, and they may get real confused, and then they may get real upset.

If we just put NOLOCK hints on these queries or say read uncommitted isolation level, we aren’t thinking at all about solving the cause of the problem.

We’re just putting a dangerous, risky workaround in place that can come back to haunt us.

So, I want you to be really careful of this as a solution, because a lot of folks will just say, oh, don’t worry.

You should worry, if you use NOLOCK.

We’re gonna reset our isolation level now to read committed, so that I’m not left in that dirty reads mode for the entirety of my demo.

Let’s talk more about the cause of what’s going on

Let’s look more closely at a couple of columns in sp_WhoIsActive. Really one in particular for my query here, who’s blocked.

I scroll to the right, there is a status column here in sp_WhoIsActive and note that it says that my blocker is sleeping.

One interesting thing about that query, the update query who’s our blocker: very often, when we look at a situation like this, and we see, hey, the first one isn’t waiting on anything, and the second one is blocked by it, it looks like this update just isn’t done.

It’s still running, and ya know, maybe, it’s just a really long update, maybe it has a lot of work to do.

The way that you can confirm that very quickly is by looking at that status, and it isn’t still working away on that update. It’s taking a nap!

The blocker already finished its update

What happened is that this session came in, and it said, I want to begin a transaction. It ran an update, and nothing else happened, at this point it is sleeping, because it’s finished the update.

SQL Server is waiting for it to come back, and say I’m done.

It can say I’m done by committing the transaction, or it could say I’m, done by rolling back the transaction, but it hasn’t told SQL Server either one.

The SQL Server will not assume, that it wants to do one or the the other. For all the SQL Server knows, maybe it wants to run more update statements, or inserts or deletes, and then commit everything as a whole in that transaction. It must wait and see.

If, of course, whatever application run this, ran this, if it has some sort of network issue, or something makes it disconnect, then it’ll be gone, and the SQL Server, be like oh, well, ya know, I’ve gotta roll you back, because you didn’t tell me, and then you left. But as long as it’s still here, we’ve gotta wait.

In the real world, in real environments, this can actually get fairly complicated, because applications don’t always wanna constantly reconnect to the database server, and they use things like connection pooling, so that they can open a connection, do a bunch of stuff on it for one thing, and then reuse it for other things. Connections may stay open for awhile.

That status column is really useful for telling us a lot about the nature of the blocking

We have a sleeping session, who is holding locks, and blocking this other session.

We know that it’s a lock wait because of the wait type on the other session.

We can use an extra parameter on sp_WhoIsActive to see really interesting detailed information about the lock waits, if we want.

We can say @get_locks = 1 on sp_WhoIsActive

Now, if I scroll on over from my blocker, I can click on this lock’s information, and open up, a big old lock graph, with information for every database it has locks in, about the objects, where it has requested locks, or been granted locks, right?

It’s either waiting or been granted, and I can see that in the WideWorldImporter’s database, it has locks on Customer, Customers.Sales, and then CustomerArchive.Sales.

I can see exactly how many of them, and of what type.

Now, as you can see, this is kind of a lot to wade through.

I don’t look at the lock graph all the time when I’m troubleshooting, because it takes a long time to look through. It can be a little tough.

On our blocked session, session 62, it might be easiest to see where it’s waiting for locks

I’m gonna click on that.

Sometimes if you’re dealing with something real complex, and it’s a lot of code, and you’re like: okay, which part of this is it blocked on? This can be real useful, just to say, I’m gonna scroll through here, and I’m gonna look for where the request status = wait on the blockee.

I can see that it’s the primary key of Sales.Customers. I happen to know from past experience, that that is a clustered primary key, it’s the clustered index on the table.

We know that the, the ID is being updated there, and we are doing this big group by. We need to know the right information for all those ID’s, so, we’re getting blocked there.

We have learned more about what’s causing our problem

Not everything about the solution, but we know that there’s something going on with our query who updates, Sales.Customers, and sets the Bill To Customer Id, who’s changing that.

There’s something about the application or person, who runs that under SW_Oops, that left it in a state, where it has an orphaned transaction.

It did begin tran, and did an update, and then it just went to sleep.

We’re gonna have to do more work, to track down the full root cause.

Temporarily, I’m gonna go ahead, and let’s see, it is session 57– temporarily, I am gonna go ahead, and kill that off.

Now, like I said, anytime we kill something, there’s always a risk that the rollback is gonna be long and painful for us, for our customers, for the group of people, that gathers behind at us our desk.

Because it’s an emergency, and it’s still rolling back, right?

In this case, I was lucky

If I run kill again, against my session, and I say with status only, it’ll tell me, are you still waiting? You’re mostly single throughout rollback. In my case, it’s not an active process ID anymore. It has completed its rollback.

If we run sp_WhoIsActive again, it has cleared up. Now, be very, very cautious, like I said, before you kill something, it’s better if you can figure out, whoever ran the query, and say can we cancel it from the caller. Do you know how much it modified? Whenever I need to kill something, I always wanna make sure that if there is a really long painful rollback, I have authorization for that too.

So, let’s dig in more to finding: what are the causes of orphaned doomed transactions? What are the causes of these sleeping sessions holding locks?

Let’s look at what can cause that, and how to fix it, over in the slides.