Load testing with WideWorldImporters
Imagine that I just left my desk and I went to get a delicious latte from somewhere. While I’m gone trouble strikes.
Our version of trouble today involves using some load testing tools that Microsoft gives you free with WideWorldImporters. It’s called the workload drivers.
You get it from the exact same place you download the sample database on GitHub.
I’m gonna open that up and I’ve just got it installed on my test instance’s S drive.
I’m gonna open up Multithreaded Order Insert.exe and I’m just gonna use the defaults for this.
It automatically is gonna look for a local SQL Server – that’s my little test instance– and look at the Wide World Importer’s database.
It will be barraging WideWorldImporters with inserts across ten threads, just using the default.
So let’s start that up.
Meanwhile “Frank” starts working too
And so while things get busy on that mimicking our order insert, mimicking a lot of customer activity, let’s say that Frank at our office also starts doing some work.
Frank, whose work is simulated by a SQL Server agent job that our setup script created when I ran it.
So Frank starts doing some work too. And Frank is running and the inserts are running.
Let’s pretend that we can see the instance when we – really I’m out at coffee, but if I was back in production here’s what I would see if I was looking at the instance.
I’m gonna just say, if I was there, I love sp_WhoIsActive for active troubleshooting. I’m gonna run this a few times and notice that things are changing.
At the instant I ran this now we do have blocking. Session 69 is blocking those ten import threads.
But if I keep refreshing this, it’s rotational blocking.
Rotational blocking is sneaky
A lot of blocking in the real world is like this too, and that’s why I call it the sneaky blocker. It comes and goes and somebody will be like, “Oh, performance is terrible.”
And then you look at it, you know, even if it’s just 30 seconds later sometimes it’s gone.
So you’re like, “Ah, it’s gone now.”
So refreshing this, it’s back, Frank is very persistent but it will come and go because Frank isn’t just holding a lock for a single long amount of time, ya know, he’s causing blocking and then one thing he does finishes up. This blocking is rotational.
So alright, let’s say that everything finishes up
Our inserts stop, the customer activity sort of wanes down, there was a big burst and now it’s less. And Frank finishes up too before we get back.
And we’re gonna simulate that by stopping the job that, Frank is a robot in our case so we’re, Frank’s done with his work now. And let’s say we were gettin' our coffee and got an alert on our phone that there was blocking ‘cause there is one or more sessions blocked in the SQL Server but when we get back, hey right now gee, things look pretty peaceful.
When I run sp_WhoIsActive everything’s finishing so fast I can’t even see it.
Did the blocked process report catch the sneaky blocker while it was in action?
I am going to look at the result from the blocked process report using Michael J Swart’s free Blocked Process Viewer.
This is what I mean by it making things nice to read as a big picture. I get a view of the blocking tree.
The lead blocker with session 69, it was blocking session 63 and then behind session 63 were a lot of others as well. I can look back through history and see was it always like that? At different times right, I had slightly different blocking trees.
Iin the real world this can be really useful to go through and say, okay I’ve got lock wait. Is the lead blocker always the same query or not? Sometimes it is, sometimes you’ve got multiple things that are causing blocking problems and you wanna know who blocks the most, which is the one I should focus on first, this bigger picture can help sort that out.
It is totally normal that the XML is null for the lead blocker
That looks a little weird at first. To find out about this lead blocker, we want to look at the process it is blocking or it could be blocking more than one.
In this case it’s just one. If I look at the blocked process report and that is the name of the report. The report is about the ones who are blocked, they also will contain info about who’s blocking them, you’ll see.
So we wanna look at the XML on the one who’s blocked to learn about the blocker.
So I’ll click on this. This will open up the blocked process report that the trace captured. I can see a couple different pieces of the XML here. I’ve got a whole section here for the blocked process that describes what it was.
All I get for this one is some information to decode. There’s a stored procedure in database ID6. The stored procedure has an object ID that I’m given. I also get some information about the blocking process.
This one, it was able to tell me a little more about the one who was causing the trouble: Frank’s update. I even get the information about a parameter that it was run with.
The information that I’m getting here about the TSQL is not always complete
It’s the input buffer, the same thing you would get if you ran DBCC InputBuffer from the command line, and it’s just kinda like this is the last thing I saw it push into the SQL Server.
If it’s a really long query or something like that you may only get part of it. You do get things like the SQL handle and things where you could try to research, okay what’s the full text if it’s still in memory. But ya know, I’m happy to what I get for free in the built-in stuff in here.
So that’s a bunch of useful info, though in a big one of these it can be a little bit of a bear to read.
We have information that we can decode in here
If we wanna know more about this blocking process, more about Frank, let’s go back in here. If I look over to the right and Frank’s blocking process is here, I’m gonna scroll over to the right, notice that it tells me that Frank is a SQL job.
Decoding a SQL Agent Job Name
But the SQL job name looks really funny, right? Job garbage, step one. I don’t have any jobs with this name.
If we go into my SQL Server agent, this job is actually named Frank Did It.
We have to decode a bunch of stuff in the blocked process report. So let’s copy that job name there. I’ve gotta scroll back, got a little lost there. Alright, so let’s copy that job name. If you see that ever you can decode that in the sysjobs table. It’s the job ID.
So if I select * from sysjobs with that job ID, I can see that is oh the job named SW_Frank did It.
So even things like SQL jobs you’re going to have to decode.
I can script out Frank’s update code (the block-er)
I was lucky this time, right? It did go ahead and just tell me, hey the code that was running was Frank’s update.
So I’m gonna do a quick sp_helptext and glance at Frank’s update.
Now, Frank’s code is terrible. I say this and I know it really well ‘cause I wrote Frank’s code and I worked pretty hard on this code to make it terrible. It’s got an outer, a big transaction, an outer transaction. And then in it, it declares a cursor in there and it opens its cursor and it has nested cursors. It’s declaring another cursor and then deep inside its nested cursors it’s doing an update in there and then going out and finishing up all those cursors. This code’s havin’ some problems.
So we can see we’ve got that big old transaction in there.
We want to decode the blocked process, too, right?
All I got from that was an object ID and a database ID and the information that it’s a procedure.
I can go into that database and sys.databases will tell me who’s database ID6.
It’s the WideWorldImporters database. I can see that the blocked process was the procedure Website.InsertCustomerOrders. That’s what our load test was running.
I’ve had to do some legwork but I have at least figured out who Frank was blocking and who was part of that blocking chain.
What were they blocked on, though?
So I know that that was the procedure. If we wanna decode even more we really can.
We get inside the blocked process report information about what the wait resources was. There we go, highlighting is tricky. The wait resource in this case was on a page. The blocked process was trying to get a page level lock, the engine had selected a page level lock and we get just this code for it: 6 3 36868.
I don’t know about you, but I don’t memorize all my page numbers for all my databases
Let’s go through and decode that. The wait resource can have different patterns, and I’ve got links to documentation for it if you have the other patterns, how to decode them. Sometimes you get a key lock wait. We have a page lock wait. So we’re gonna do the page one today.
Alright, so here is my page resource. This is database ID6, that’s WideWorldImporters. We could look that up in sys.databases.
It is data file number three. We have multiple data files for our database, and then this is the page number, the physical location within that file for that database where the page is that it was waiting for a lock on.
We can use an undocumented command called DBCC PAGE to see more info
What is on this page and what table does it belong to?
If that stored procedure was really, really complicated this might be one quick way to see that.
Now word of warning, DBCC PAGE is technically an undocumented command in SQL Server. It is very widely used, like I am far from the first person to show you this.
But if a problem happens when you’re using DBCC PAGE, Microsoft’s gonna say: we just don’t support that.
You can run DBCC PAGE against a restored backup of the database, though. You don’t have to run it against the production server.
So if you don’t wanna run undocumented commands against production, I support you in that, and you do not have to!
But if you wish to run this command the first thing to do is turn on trace flag 3604. That says for my session: when I run DBCC PAGE, please show me the output in my window.
Now, I can plug that page number into DBCC PAGE
I would like you, SQL Server, to give me information about the page in database ID six, data file number three, the page number that I found, and I’m gonna use what’s called dump style two.
DBCC Page can return information in a bunch of different formats. And dump style two gives me the format you see on the screen that is a pretty detailed view.
There’s a lot of info about this. If I scroll down and look in the page header section I can see information that’ll help me identify what table this belongs to like what is the object ID for this guy.
So let’s see if I can, here we go. So metadata, I’m like I know you’re in here somewhere.
Metadata, object ID, it’s this value right here, and I can also see that for that object it is index ID =1.
If you see index ID = 1, that is always the clustered index of the table.
So we had blocking on the clustered index of the table with this object ID and then we can use sys.objects to decode, okay our blocking was on the table Sales.OrderLines.
Alright, so we have set up the blocked process report to capture this information about the sneaky blocker that was happening.
Your challenge for this problem
Put into your own words a quick short recap just of what’s going on in the situation.
What are some high level suggestions that you can think of to stop the sneaky blocker from coming back?
You do not have to go through and rewrite Frank’s terrible code.
Please trust me, I know it’s terrible. Act as if this was a real world situation and you were working with Frank. Your challenge is to come up with a way to describe to Frank and others: hey here’s a big picture description of what’s going on, and here are some approaches I think we can take to change the sneaky blocker so that it is not causing so many problems.