The dirtiest secret: NOLOCK returns garbage data (10 minutes)

Transcript

Imagine that we have an application that’s going a little haywire– and we don’t necessarily realize it

This happens sometimes with applications. I’m going to copy this code, open a new session, paste it in, and uncomment it. What’s happening with our application is that it’s trying to make some updates, but something is failing and the updates aren’t committing. It’s hitting an exception and they’re rolling back.

I have done this in a cheap and easy way by manually putting in a rollback command, but imagine that they’re hitting some sort of error in the business logic. Maybe it’s a longer transaction than we have here, but sometimes we get these situations where our application tries to do something, it hits a validation problem, and it has an exception and the whole thing has to roll back. We’re just simulating that here with some updates. I have this cheap and dirty GO command saying run this 100,000 times so that this will just be constantly happening in the background.

The way that people get into NOLOCK is…

The most common cause, the most common gateway to using NOLOCK is that people see lock waits.

Imagine that we also have something else in our application that’s constantly tracking what is the sum of the total count of babies. It takes that TotalNameCount column, the same one that is being updated by our application. It takes that column and it sums it up across the whole table. Now, ref.FirstName is not a very large table. We’re going to have to scan this table, but it’s a small table.

What this block of code is doing is: it’s creating a temporary table named NameCount, and 200 times it’s going to sum up that TotalNameCount column and return the total names. It uses some DateDiffs to capture a rough count of seconds of the duration of how long it takes to do this 200 times in a loop.

We don’t have any hints in our query

Our query for this first run is just selecting the sum from the table. We are using the default isolation level of read committed. I haven’t changed my default isolation level in this database to use any optimistic locking or anything. I’m using the default when you install SQL Server yourself of read committed.

This is different than Azure SQL Database: that’s the one place where it installs differently– or, I suppose, it is installed for you and uses a different setting.

All right, so let’s get this going.

What happens often is what people see, is they get a report that things are slow, and they open up something like the built-in Activity Monitor in SQL Server to say, “Okay, I want to see what activity is going on.” Activity Monitor isn’t perfect. It isn’t going to show us everything. It isn’t even going to show us all types of our waits. It shows us resource waits. It won’t show us pure CPU waits, but we can see: looking at this, my script has already ended, but I can see that I have recently had lock waits on my SQL Server. I haven’t had any other waits, but I’ve recently had these lock waits.

If I run my script again, the first time it did the count– it did that sum from ref.FirstName, it did it 200 times. The first time it took 18 seconds to do that. Let’s run it again. It’ll recreate the temp table, and while it’s running, we can see that we have active lock waits on our SQL Server. I have 131 milliseconds of lock waits. This isn’t a huge amount of lock waits right now, but if I’m looking at, “Hey, I want to speed up my SQL Server,” my wait stats are indicating there is a number one area that you should look.

My reads finished again. It took 17 seconds again, and this normally, if I’m not running a lot of other stuff on my laptop takes 11 to 12 seconds. I’m having a lot of resources on my laptop in use, so it’s consistently taking 17.

Let’s run it a third time and see if we can figure out what queries are involved using that Activity Monitor tool. Like I said, it’s not perfect, but it does give us some information. I can see that the command with a lock wait, specifically I get the wait type this time. It’s waiting for a shared lock wait this time. It is an insert command, and I can even– if I catch it when it’s happening… Let’s run this one more time. It took 16 seconds this time. If I catch it when it’s happening, and when it’s being blocked…. Let’s see if we can get the lock wait here.

Here’s our insert. There we go. Now we have our lock wait there. If I right click and say ‘Details’, I can identify, “Hey, it’s this SELECT who’s having these lock waits.”

And the way that people get into NOLOCK is they find things that are slow, they see that there’s someone being blocked, and they do an Internet search.

How do I get rid of blocking?

One of the things that you’ll find on the Internet is that one way to get rid of blocking– and to speed things up, right? If blocking is making me slow, and I can get past the blocking, I’m going to be faster generally, right? So legitimately speed things up is to add a NOLOCK hint. This one is taking us 16 to 17 seconds with our current activity on the host of our virtual machine. Let’s look at the counts that we got before we add any NOLOCK hints.

We’ve got our application going kind of haywire in the background, and it’s trying to update things, right? And it’s rolling them back.

Under read committed, every time we counted, all of the 200 times, we got the same answer

read committed will only read committed data. This data in the other tab, it’s going haywire, and it never commits. It rolls back. It does the updates, and then it says, “Oh, I’ve got to go undo them.” It never commits, and read committed, it says, “Okay, if you’re not committed, I will wait. If you’ve got a lock, and you’re not committed, I will wait for you faithfully,” so that we did always get the correct answer for the committed data.

Let’s add that NOLOCK hint

We want to speed this up, so we make one change to our query and we add this table level hint that says, “I want to use read uncommitted against ref.FirstName.” We’re still doing 200 runs. We’re still checking our time before and at the end. Everything else is the same, and we’re recreating that #NameCount table. I’m going to drop that table and run this, and let’s look at Activity Monitor while it runs.

Let’s look at our waits. We don’t see any lock waits at all this time, right? I don’t see any. I have backup wait as my top wait here. Lock is down here, and it’s got zeros for its wait time and recent wait time. It looks like we solved the problem, and the duration of our query was faster.

This wasn’t 16 or 17 seconds this time. This was 10 seconds.

This is how NOLOCK ends up staying in the code and starts proliferating: because I added NOLOCK and it got faster

It’s sort of like we got a reward, which makes us want to use it again and again and again. It was an easy fix. It worked immediately. But it does have some problems.

Let’s look at the data that we got this time

When I ran with read committed, all 200 times that I counted I got the same name count, which was just the committed name count.

When I used NOLOCK, otherwise known as read uncommitted data it didn’t matter that that data in the other tab never committed. It doesn’t matter that there was a problem and that that data never landed in the table. Whatever state the data was in, we said, “Okay, I’ll just count it up.” So we got six different answers for how many babies are in the United States.

And it may be that if we currently look at the state of the data right now, there’s less babies. Because this is the number of babies born, right? We might end up with a report that says there’s a higher number of babies born, and then they run it again in two minutes, and the number is lower. Well, how did that happen?

The data will just look like it’s totally wrong if we are returning uncommitted data to our users and they don’t understand that we are reading data that has never been finalized or accepted as committed inside the SQL Server.