You’ve got some troubling wait stats in SQL Server. How can you tell which queries are causing those waits?
Learn the pros and cons of different techniques to track down the cause of both common and tricky waits in SQL Server, including CXPACKET, PAGEIOLATCH, LCK, RESOURCE_SEMAPHORE, and THREADPOOL waits.
Get the scoop in this 35 minute video. Don’t have time to watch? Scroll down to read the article, or subscribe to the podcast to listen on the go (it’s on iTunes or you can plug this URL into your favorite podcast app: http://dearsqldba.libsyn.com/rss)
I recently got a great question from a reader:
There are many articles and how-to’s about SQL Server waits. Plenty of scripts to check for biggest waits and types even all waits for chosen SPID which is executing.
But how do I find the exact queries which are causing my waits?
I’m using SQL Server 2008R2.
You can see waits for all executing queries
As our questioner mentions– you can see waits for a given SPID. With free tools like Adam Machanic’s sp_WhoIsActive procedure, you can see waits for all currently executing queries– not just one SPID, but everything that’s running.
This can be useful for seeing, “Who is causing or having this wait right now,” and can sometimes be useful for home grown troubleshooting (I’ve got some examples below in the rare but dire waits section below).
Warning: There are some interesting “gaps” in looking at the waits for currently executing queries
For example, you won’t see SOS_SCHEDULER_YIELD waits in sp_WhoIsActive. That’s because it doesn’t show up in the sys.dm_os_waiting_tasks DMV, which is the go-to DMV for current resource waits. SOS_SCHEDULER_YIELD is not technically a “resource” wait, it has to do with how queries periodically yield and are queued to get back on your CPUs: the cycle of cooperation that means everything isn’t just first-in, first out.
There is a new twist to this. In SQL Server 2016, we have a new DMV called sys.dm_exec_session_wait_stats. It shows cumulative waits per session, and it does include SOS_SCHEDULER_YIELD. The information is cumulative for SPID and shows even when the SPID is sleeping, but goes away when it disconnects or is killed. This could be interesting in some systems or for some tests, but for systems where connections are used for lots of different queries, it may not be incredibly useful.
You can trace query waits with Extended Events
I once had an extremely busy system where we had growing CMEMTHREAD waits. This is an unusual wait, and our question was: is this being caused by a single query, just a few queries, or all queries? We were able to answer this by setting up an extended events trace and looking at waits by query, but this had some downsides:
- Extended Events has no GUI in 2008R2, and setting up and testing the scripts took a bit of time (more minor issue)
- Generating wait information by query on a very busy system generates a lot of output, so we had to be careful to set up sampling and filtering so we didn’t impact performance (more major issue)
But we were able to use this to figure out that the wait was associated with all queries, not a few queries, which helped us down our troubleshooting path.
Because of the overhead of collecting all of this information with a trace, though, I have only used this for looking at a specific time frame and troubleshooting a specific wait, and often sampling the trace down is required. I wouldn’t set this up as part of regular full system monitoring of common waits due to concerns about impacting performance.
You can make inferences about some common waits
For some common waits, it is easiest to make educated guesses. These methods aren’t perfect. Things like instance restarts/failovers, memory pressure and recompile hints remove information from the query cache. But they’re fast and easy, with low risk to performance. Examples:
We know those are specific to parallel queries. Parallel queries typically have a higher average CPU time than an average duration. So you can query the SQL Server dynamic management views and look for your top CPU queries, and then look at those whose CPU time is higher than their duration. They are likely your top CXPACKET generators. Here’s a sample query to find top CPU queries.
These are waits for pulling pages from storage into memory. Typically queries that do large amounts of physical reads are your big suspects for this. You can query the DMVs and look for your queries that do the highest average physical reads or writes, and start evaluating them. Here are some sample queries to find top queries for physical reads and writes.
(Note: there’s a more detailed option for finding the cause of PAGEIOLATCH waits below.)
These are lock waits associated with being blocked. For these waits, I would use something like the blocked process report if I was troubleshooting blocking and wanted a simple way to capture who is blocking who.
Right now you can enroll for free in the Troubleshooting Blocking & Deadlocks in SQL Server course to learn how to fight blocking.
You can get crafty with some (fairly) rare but dire waits
This is a wait associated with low “query workspace memory” and it causes the SQL Server to feel like it’s not working. You can still use the dedicated admin connection, but “normal” monitoring queries stop working when it gets very bad.
For this one, there’s a performance counter called Memory Manager\Memory Grants Pending. In one case where we had these waits and no full fledged monitoring system, I set up a SQL Server Agent Performance Alert based on this counter. Whenever it rose above 0, the alert triggered a job that recorded a few samples of sp_WhoIsActive and a few other queries to tables. This helped us see what queries had the wait, and what queries were running that had large memory grants– because the alert triggered the job before the waits got too bad and we could see the start of the problem. We then could troubleshoot ways to reduce the memory grants for the executing queries.
This is a wait associated with not having enough threads - it means there’s a lot of active processes, so many that the SQL Server can’t make more threads without risking stability of the OS. This can be associated with problems in connection pooling, or sometimes it happens with some applications if you get a lot of blocking – when those queries can’t complete, some applications just start taking out a lot of other connections.
Similarly to RESOURCE_SEMAPHORE, you can still use the dedicated admin connection, but normal queries “stop” working when this gets really bad.
In one case with this one, the connection pooling setup looked normal, and I highly suspected blocking was involved. So I set up a SQL Server Agent Performance Alert based on the General Statistics\Processes Blocked counter, and when that rose above 1 we captured output from sp_WhoIsActive to a table, along with capturing a few other diagnostic queries. This allowed us to confirm that blocking was the root cause of the issue and see what the blocker was doing, because again it captured activity as the waits were building up.
Knowing who is waiting is often not enough!
Let’s revisit the question:
How do I find the exact queries which are causing my waits?
One interesting point about this is that the query waiting is not always the query causing the wait.
The query waiting for a lock has the LCK_M_?? wait
But it’s just the victim! The query holding the lock who is blocking does not have the wait. (Or if the blocker does have a lock wait, we want to know who is at the root of the whole blocking chain and holding the locks that are causing it all: they aren’t waiting.)
The query waiting for query workspace memory has the RESOURCE_SEMAPHORE wait
This is another bystander/victim. What we want to know is what queries have memory grants, how many of the queries are their, how big are their memory grants, and should they be that big? They may not have any waits at all!
Even for PAGEIOLATCH_?? waits, a query having this wait might well be an innocent bystander
Perhaps the instance has recently started up, and the query wants pages that are frequently accessed, but it happens to be the first that reads them into memory, and it’s slow. Or perhaps a whole bunch of other queries took over the buffer pool for a while, and now this “normal” query needs to read frequently read pages back into memory. What we really want to know is what queries were running when our buffer pool memory started “churning”.
An easier way to find the cause of this wait might be to track the “Buffer Manager: Page Life Expectancy” performance counter, and look for times when it starts to take a sharp nose-dive. Then look at what queries were running at that point in time - particularly those that did a lot of reads. (If you aren’t tracking this counter, you can make a more general guess by looking at DMVs for your top queries doing the highest physical reads after the system has been up for a while, as mentioned above.)
These are just a few examples
For many waits, we need to know about what’s executing or has an open transaction, their session status, what resources they are using, if they have any (different) waits, and sometimes information about their execution plans. We need a lot more info than just query + wait.
Extra investment makes this easier in older versions of SQL Server
Professional monitoring tools that capture overall waits, as well as executing queries at different times and information about blocking make it easier to solve these. You still need to make some educated guesses, but you don’t have to “rig up” things to capture the information as often, because it’s regularly collected.
Looks like Query Store is going to make this much easier in future versions!
In SQL Server 2017, we have wait stats information coming into Query Store. I haven’t kicked the tires on that yet, but it looks like that is going to make answering your question much easier in future versions of SQL Server.