A query is slow, and you figure out how to collect the query execution plan. Now what?
In this video, I talk “big picture” about what execution plans are, what “cost” is, why to collect “compiled for” values, and the steps I take to analyze execution plans while performance tuning queries.
Video version (26 minutes)
Please forgive errors in grammar and punctuation: robots helped create this transcript. And I’m grammatically challenged.
Welcome to Dear SQL DBA, a podcast for SQL Server developers and database administrators. I’m Kendra Little from LittleKendra.com.
This week I’m talking about a question that I recently got from a student– but I’ve gotten it from different people by email over the years, too. We’re talking about how to analyze a SQL Server execution plan.
Here’s how the question came in this time: a student asked, “I’ve got a terribly performing query, and I’ve collected a complex execution plan for it. But I’m not sure how to interpret valuable information from the plan. Big picture: What do i do? How do I attack this thing?”
We’re gonna talk about this and break it down.
First up: what is an execution plan?
When you submit a query to SQL Server, it needs to figure out: how am I gonna run this thing? What indexes do I want to use? Can I do seeks on them do I need to scan them? What kind of joins? If you’re joining different tables together, what kind of operations do I want to use to join these things together? Do I need to build temporary objects? Do I want to do lookup operations? Should I use multiple cores to do this or should I do it single threaded?
There’s a lot of different things for it to consider and the execution plan is compiled by the SQL Server query optimizer before it ever gets started running the query. SQL Server has to use things like statistics– these little samples of data that help it guess, “hey how much data will I be dealing with from this particular table?” It also has to look at things in the query: if you have parameterized values it’ll try to sniff out, “hey if i have to compile a plan for this, what are the the values you’re using for the parameters for this stored procedure?”
Or perhaps it’s a parameterized query. Because if you’re if you’ve got those values in a predicate and you’re limiting the rows by certain filters with those values, that influences heavily how much data is coming out of different tables involved in the query.
It has to make a lot of guesses before the query ever starts executing, and SQL Server doesn’t reconsider the execution plan while the query is in mid-flight.
There is a feature that’s coming to us in SQL Server 2017 where certain query plans under certain conditions will have parts of them that are quote “adaptable”. Where on later runs of the query, SQL Server might be like, “that didn’t work out so well, so I’m gonna try a different tactic just for this part of the plan,” but that’s only in certain cases and we don’t even have SQL Server 2017 yet. So as of right now with SQL Server 2016 and prior, optimization only happens before the query starts running, and it’s all based on these estimates. SQL Server doesn’t stop and think, “did the query turn out like I estimated?”
Execution plans can be reused
SQL Server doesn’t want to compile an execution plan every time that you run a query if it doesn’t have to, so it’ll look at it and say, “hey do I have a compiled plan in cache for this query, and maybe are we running it the same stored procedure or this same parameterized query – are we running it with different values for the parameter?” In a lot of cases, it’ll say, “good thing I already compiled the plan for that, I’ve got it in memory. I can just reuse that execution plan.” So an execution plan is a way that SQL server guesses: given all my options, I think this is the best way to execute this query. It makes that guess before it ever starts running the query. Sometimes it’s using a guess that it compiled earlier, possibly for different parameter values.
When you ask, “why was the query slow?” the answer is NOT always in the execution plan
We have this additional factor in addition to everything I’ve started talking about: sometimes it query is slow for reasons that have nothing to do with the execution plan itself! Or at least that aren’t the fault of the execution plan.
Maybe my query needed a lock on a resource and something else was holding that lock, and it just had to sit and wait for the lock for a really long time. Yeah, okay, the execution plan was accessing a structure that was locked, it is *kinda* related to the plan, but the cause of this is more like: “Well why was something holding that lock for a really long time?” or “What isolation levels are we using?”
Those aren’t a problem with the plan. Is the issue that storage was really slow? Sometimes the query is slow for reasons entirely different from the plan, and you wouldn’t know that just from looking at the plan.
Things are changing fast though: as of SQL Server 2016 Service Pack 1, certain kinds of plans called actual execution plans do you have some information about wait statistics in them. It’s only that type of plan– the actual type of plan in SQL Server 2016 Service Pack 1 and later.
Most of the time when we’re looking at an execution plan we don’t have that information of did the query have to wait, and if so what was it waiting on.
We usually don’t have information in the plan about how long the query took
Even looking at the actual execution plan, only if the SQL Server Engine – not where you have management studio is, but the instance you were running the query against– only in SQL Server 2014 Service Pack 2 and higher will you have information about how much CPU was used, how what was the duration, and how many reads did something’s do. Even then it’s still only those actual plans: only one type of execution plan.
So how long was the query even slow?
So many execution plans– when you look at those most execution plans you can’t even tell how long did the whole query take, much less how long did a specific part of the query take. What all this means is that execution plans are absolutely valuable – I love working with them– and they make tuning queries much easier to me, but you do have to do other legwork to figure out things like were there waits in wait statistics? In other words, were there resources that the query didn’t have to execute, that impacted it, that we’re outside of the plan’s control.
How long did the query take in parts and in whole? Having that extra information and figuring it out is a valuable part of interpreting the plan.
I love plans I’m not putting them down, you can get tons of valuable info from them just be prepared to dig for supplementary information to help you interpret the plan, or to know if the problem can be answered in part by the plan.
When I’m interpreting plans, I think about a lot of different things and I’m going to step through today just what are the types of execution plans, what is “cost” in an execution plan and how I think of it, what “compiled for” values are and why you should collect them, and then steps to interpret the plan.
Types of execution plans
Estimated execution plans
If you’re going to run a query against SQL Server in Management Studio, you can say, give me an estimated plan for this. SQL Server will compile, “here’s what I think I would do,” without executing the query. These can be really useful because maybe your query takes 15 minutes to run you just want to get an idea of how much work do you think this is gonna be, without waiting the whole 10 minutes or using those resources.
You can also see certain things in estimated plans that show up in a different way than they do in other plans: things like functions. If you look at an estimated plan, they will actually show up in a more obvious way – they’ll be broken out in more detail than they are when you’re looking at what’s called an actual execution plan.
Sometimes you can get a really clear insight of, “oh look at all this work involved in this function” broken out in a different way in an estimated plan when you look at that in Management Studio. But an estimated plan, is just that, “Here’s what I think I would do. Here’s how many rows I think would come back from this, here’s what index I think I would use.” We don’t know how many rows are actually gonna come back from that, so we just have information about “here’s here’s the direction I think I’d head in.”
Cached execution plans
A lot of times when we’re troubleshooting a slow query we are working with an execution plan that we have pulled from the memory of the SQL Server because we’re looking for our top queries and we’re saying, “hey based on the information you have in memory, SQL Server,” maybe I’ve got a query where I’ve said I want to see the execution plans for the queries that have the longest duration. I pulled these plans out of cache.
These cached plans can be reused, and I may have overall information on, “since this plan was put in cache, how much CPU does it take on average? What duration does it take on average?” And also things like what’s the max duration it’s taken, what’s the max CPU it’s used. I have aggregate information like that, but with this cached plan I don’t know things like – okay you estimated 10 rows would come from that index, was it actually 10 or was it 10 million? The cached plan just has things like those estimated rows. It doesn’t have details on actually how many rows flowed through different parts of the plan. It doesn’t have duration information for how long different parts of the plan took.
I can kind of make some guesses based on those aggregate statistics, what I know about the tables, and some testing, but again I have to do some legwork to figure out, “hey is all of this estimated stuff in the plan, does it it does have anything to do with what actually happened when the query ran?”
(Note: I think it was Brent Ozar who I first saw categorize cached plans as a different type than estimated plans. It may make sense to break out Query Store plans as their own type as well, as there are going to be certain nuances where they’re difference from the plan in memory, especially when adaptive query plans come into play.)
Actual execution plans
These are really, really valuable, but we have to be careful generating them. One of the deals with actual execution plans is that tracing plans (any type, not just actual plans) may slow down your SQL Server a lot. Collecting them by trace is really expensive and can hurt performance.
You can get them by running a query yourself, but you have to run the query yourself and sometimes that’s not ok. Sometimes we have to restore the database someplace else to be able to rerun the query, and that can take a lot of time. Sometimes if the query is modifying a bunch of data that process of running it someplace where it’s safe to run, and then resetting the environment again can be somewhat cumbersome.
But actual execution plans can have a wealth of information, especially on SQL Server 2014 service pack 2 and higher where we can get things like not only how many rows flowed through different parts of the plan, but also information on how much CPU were we using at this point in the plan. It takes a little while to get used to interpreting this, because in certain operators that information is cumulative with its child operators, and in certain operators it’s not, but it’s a wealth of information for us. We also get those wait statistics in SQL Server 2016 and higher.
Cost is an estimate - even in “actual” plans
Depending on knowing what type of plan we’re looking at, when interpreting a plan it’s really useful – even in an actual plan – one important thing to know is that when you’re looking at cost numbers they are always an estimate.
Even though we call it an actual plan, an actual plan is an estimated plan that’s had some additional information added to it about things like, “okay what were our actual rows and what were certain actuals” But SQL Server doesn’t go back and readjust the cost. It wouldn’t make sense to in a lot of ways, as the cost information in the plan has to do with the reasoning behind why it chose these operators. If the cost was just sort of completely adjusted all the time, sometimes looking at a plan it would make no sense why it had done that.
So I actually like that cost is always an estimate, the important thing is just to remember that and don’t fall into the trap of always just looking at the “high cost operators.” I do think it’s valuable to look for the operators that have the highest cost– I will look at those in the plan. I’m gonna be like, “okay this is what SQL Server thought would be the most work.” It’s very valuable information you know about how it optimized the query, just know that it has to do just with those estimates and what it thought would be the most work and what actually took the longest in the query may or may not be those highest cost operators.
(Note: Kimberly Tripp is great at driving this point home about cost being an estimate. I always think of her when this comes up.)
There is no magic shortcut
This all means there is no single shortcut when you’re looking at a complex plan, there’s no single shortcut that just lets you know exactly where to look all the time to find the secret info.
It’s just not that easy and looking at big complex plans can absolutely be time-consuming. It’s not that you’re dumb it’s that it’s hard, and there is a lot to look at.
Collect “compiled for” values
One of the things that doesn’t always show up right in front of you in the plan that I always want to collect though is to check the properties of the plan and see what was this plan compiled for in terms of parameter values. Not all queries are parameterized, so figuring out, “okay, look at the query does it have parameters?” is part of that. If it was a parameterized query, what were the compiled for values in a cached plan.
If you’re pulling plans out of the cache, make sure that you know what those are, and you make notes– okay I’m looking at a plan that was compiled for these parameter values. One of the things that’s tricky about analyzing execution plans is: when if you rerun the query for those “compiled for” values it may be really fast. So okay, well when the query was run when it was slow, was it being executed for different parameter values?
Here’s the bad news: those aren’t in the cached plan.
It may be that the query slow because if a parameter sniffing issue, where it’s compiled with certain values and then it’s slow when it runs with different values. The execution plan will give you the compiled for values, but you have to do a legwork to figure out what other values is this often executed with, and why may it have been slow. Maybe that requires doing tracing, maybe it requires looking in your application logs. There’s lots of different ways you can research this but that’s part of the piece where like all the information we need isn’t always in the plan.
Big picture: Interpreting a plan
When I’m interpreting a plan big picture, here’s what I do. I analyze.
I step back
Sometimes literally zooming out to look at the plan and try to figure out, how is the data flowing through this plan?
I’m thinking about where I may be able to start, looking at the whole shape of the plan. Then say okay I’m gonna zoom in now on certain areas that are high cost. This is where SQL Server thought it would be the most work.
Then I’m gonna step back and start asking questions about was that really the most work. I’m not gonna assume that those were the highest amount of work, I’m gonna look at those values it was compiled for and note, okay this is was optimized for these values, so if I’m executing this for different parameter values do I get different plans, and how do they compare? Maybe something I want to look at if I suspect that plan reuse is an issue.
I’m gonna start noting likely suspects for why I think different parts of the plan might be causing the query to be slow. Suspects include large row counts. Maybe these are large row counts from a high estimate if it’s just a cached plan. Maybe it’s just a high estimated row count, I don’t know for sure, was it right? Even if it is a lot of rows I don’t even know for sure if that was slow. It’s a possible suspect.
Do I see evidence of there being spills, or do I see estimates that I think might be way off, I see anti-patterns in the query plan, where I’m like, “we’ve got a lot of functions being used here,” or “we’ve got implicit conversions that may be causing a scan,” We’ll put these in as suspects.
Then I set up a repro case and start testing things
I have to check out all of these suspects, and I have to check them out even if they’re anti patterns.
I have learned one lesson the hard way: when looking at execution plans to try to pinpoint what makes a query slow, this has happened to me so many times where, there’s something in the plan that’s just the glaring anti pattern. It’s a well-known thing you shouldn’t do. And you start testing things out, and that is NOT at all why the query is slow.
But if you if you start harping on it too much, too early, then you really have egg on your face when you test out fixing that anti pattern and it’s still slow. Or maybe removing that anti pattern makes it even slower.
Just because something is an anti pattern don’t assume that is what’s making it slow, you’ve got to test it out and check it!
Most of your ‘likely suspects’ are going to be innocent
Execution plans and code are complicated, and you’re gonna come up with a lot of candidates that might be making the query slow, and they just don’t pan out. You’ve got to just keep hunting and figuring out, hey what is the thing that’s making this slow? If it’s part of the plan itself.
Setting up that repro case and testing it is really where you start interacting with the query plan. You start testing out your hypothesis. I made this change now what happens? Does it get faster? Does it get slower?
That’s really where you start the query tuning process. You set up hypotheses using the plan and related info. The actual tuning process is when you dig in setting up a repro.
It takes takes work, sometimes you can’t just run the query against production and sometimes you aren’t lucky enough to have a realistic development or staging database to work with. Sometimes you’ve got to set up staging data yourself outside of production so that you can test one part of a process. What if my query is slow in the middle of a nightly batch process and I’ve got to get it to the point where the data is right before that process begins in order to test the slow query? It’s absolutely worth it to do it, but yeah you have to do a little bit of work if you really want to test your hypothesis sometimes.
Also, if I have a really complex query and I really want to work on part of it, I can try to break out that part of the query. Sometimes when I vastly simplify a query everything changes.
I may have to do a lot of work to reproduce the behavior that I’m seeing in the problematic execution plan to be able to test it in a smaller format.
Sometimes I also have to execute a query multiple times to get one execution plan in cache, and then reuse it with different parameter values.
But all of these steps in setting up a repro and testing it are steps where I learn more about the nature of the execution plan how the data is really flowing through there, and why things are slow. So in this process of building the repro, I’m still working on the tuning process, I’m not wasting time.
Ways to save time
If you have a monitoring tool that looks at your SQL Server and collects information like how much CPU and how many logical and physical reads do queries use, how long do they take, and what are their cached execution plans – that can be helpful because when you’re looking at a plan and you want that information of “okay, this time what parameter values was it executed with?” “How slow was it?” You have a place where you can go get those. You don’t have to setup a new trace or run a bunch of queries to figure it out.
If you have SQL Server 2016 and higher the built in Query Store feature that you can enable per database collects much of this information. A lot of it is in aggregate form. It’s not as detailed, it is not a monitoring tool, but it has aggregate information that can help you figure out hey has this query had different execution plans over time? For different intervals, what were the execution statistics for that query plan in that interval? They are an aggregate, but they persist over restarts. They’re a lot more stable than just hoping the information you want is in memory, and that can be really helpful. It’s built right into the SQL Server.
There is an art to pulling this all together
Which parts of the plan are taking the longest, what should you do to change the behavior? Right? Because even in an execution plan, even when I’ve analyzed, here here’s the part of it that’s slow, I then have to start making additional guesses about, okay well here’s the part that’s slow what do I do to make it faster?
I’ve absolutely found it to be the case where I do a lot of work I figure out exactly which part of the complex plan is slow and then I stop and I’m like uh-oh, geez, how do I fix it???
There’s no obvious option, and I’ve got to be a little creative with saying, “okay, do I need to rewrite the query? Do I need to create an index? Do I need to change an index? Do I need to use a hint?” What are the things I can do to influence SQL Server’s behavior? Then test them.
Very often the thing I want to do to change the behavior may make the query slower, so I have to make sure that my change will reliably make the query faster, even when it’s optimized for different parameter values. Narrowing as you go through this process– narrowing down which parts of this query are making it slower– you’ll learn more about the query and you’re really not wasting time.
The more you learn about the query, you’re gonna use that information when you have to start guessing about how to make it faster as well, and trying to make it faster reliably.
The more that you break down execution plans and dive in – it can be frustrating and you’re not gonna have a success every time, trust me I have not had success every time. I’ve looked at an execution plan, I need to figure out more information, I need to set up a repro, I need to get an actual plan, it will take work to do that.
The more you do this, the faster you become
You’ll develop those suspects more quickly and you’ll come up with different tests you can do to see if you can make it faster more quickly as well. I think it really becomes more fun the more you do it, too.
Thanks for joining me for Dear SQL DBA. I’m Kendra Little from LittleKendra.com