How to speed up the "Max/Min" query (16 minutes)

We can speed up this query

But I’m warning you, this is some of the ugliest TSQL you’ll see today.

Numbers tables!

In this video, I assume that you’re familiar with Numbers tables. Here’s a write up explaining the concept and benefits of a Numbers table by Aaron Bertrand: https://www.mssqltips.com/sqlservertip/4176/the-sql-server-numbers-table-explained–part-1/

Parallelism inhibitors

Read Paul White’s post which talks about forcing parallelism and parallelism inhibitors: http://sqlblog.com/blogs/paul_white/archive/2011/12/23/forcing-a-parallel-query-execution-plan.aspx

Transcript

We’ve got a couple different options to tune our partitioned query. They have different trade-offs.

Let’s take a look at the first.

We can create a nonaligned index on our partitioned table

In other words, I can create a non-partitioned index on this guy so that the query has the chance to run just like it did before we partitioned the table. I’ve already run this command–it takes more than a minute and kicks up all the fans on my laptop. I have an index on this partitioned table. I said don’t create it on the partition scheme, and it’s not going to be partitioned by FakeBirthDateStamp into years. It’s just one simple rowstore B-tree. Now, I’m going to run the same old query against my partitioned table.

Turn on our execution plans, and wow, it’s back immediately. I did an index scan. Let’s look at the properties of this guy. Just like against the original non-partitioned rowstore index, I got a nice backward scan, and this went against the nonaligned index. So it said hey, this is one nice, beautiful, simple tree. I’ll go right to the top and just pick a row off, I’ve just got to read one row.

But of course, there’s trade-offs to having nonaligned indexes. We can no longer do partition-level operations. So if I try to do a SWITCH in, a SWITCH out, or this TRUNCATE, I’m going to be told I can’t do it because your indexes and your table don’t all use the same partition function definition. Big downside and big trade-off for nonaligned indexes. It does help in this scenario, but I really like having other options because those partition-level operations are some of the coolest things and the biggest reasons to partition your table.

Let’s take a look at a query rewrite

This is the same pattern that is recommended in that Connect bug I showed you {in the previous lesson}.

The solution relies on this little piece of T-SQL, and you’ll see that IntelliSense doesn’t really like it. It is documented, I promise, but IntelliSense kind of never got the memo about the magic partition function. I can say hey, SQL Server. Show me what partition this row is.

The partition function, in this case, I’m just doing a simple query where I’m saying okay, show me the rows where FakeBirthDateStamp is a given value. I want to see not only the columns in the table, but the partition number for the row.

So we’re going to run this query. Perhaps we’re going to run this query. There we go, I’ll highlight it.

We’ll run this query, and it shows me that the rows I’m looking at here are all in partition number two. That quick intro there just to what this function does. We can use this function as part of our workaround.

Here is a little bit of the solution. This isn’t the full answer yet. The pattern is essentially saying: take a list of partition numbers and use that to join to a query that forces SQL to optimize the query to do a little look into each partition. We essentially use a join to say I want you to optimize per partition. In this case, I’m using a value constructor to list out three numbers.

My table has way more than three partitions, but I’m just saying I want you to look at partition 41, 42, and 43. This values constructor creates a little three-row table with 41, 42, and 43 in it. I am then using CROSS APPLY to say: I want you to join to these partition numbers. Here’s the partition number function. I want you to join to every number that’s in this tiny little table. In this case, it’s just three numbers. I want you, for each of those, to figure out what the max is for that partition and then higher up in the query, I’m going to figure out the max of those three values.

Let’s run this little guy and take a look at the execution plan. I’ve got my actual plans on. Again, I’m doing MAX for this guy. Let’s run it again because I didn’t have the data in memory yet for this guy. On the second run, it’s much faster because the data is in memory for the second run. I’ve been comparing data in memory throughout all my tests, so I’m being consistent. I don’t have a giant scan jumping out at me anymore. Up here, I have a constant scan. Let’s get our Properties window open. Our constant scan, if I look at my values down here, is those scalar operators, that little values constructor I did. It’s saying I want these three rows. For each of those, it goes into a nested loop. And then let’s go down here. We do this index scan which, scroll down a little bit, we are using the object FirstNameId, the index we’ve got on — here, let’s pop that box up for the index name– We are looking at the rowstore partitioned index on FirstNameId, and we are doing a backward scan.

We’re going into, in this case, three different partitions. In our CROSS APPLY join, we’re saying: I want to look at the partitions with these numbers. We’re accessing partitions 41 to 43, and our number of rows read is three because we’re finding the max from each partition. We’re pulling it back and then doing the overall max. Well, this is a good pattern, but I’ve got more than three partitions in my table. We can adapt this pattern in a couple of different ways. I could of course do a values constructor like this that has hard-coded all of the numbers of my partitions.

But what if I’m regularly switching in new partitions?

What if my table is growing, and it doesn’t have a set number of partitions? Sometimes I always have a certain number of partitions, and whenever I switch in, I switch some out. I don’t always have my table grow, but it’s certainly valid for it to do so.

Well, we can dynamically figure out how many partitions we have. The sys.partition_functions Dynamic Management View, if I look at, this is my partition function that I’m using. I can query that and join to another table to get one row per partition.

Here I’ve just built a numbers table that has a list of numbers in it. I can use the fanout column in sys.partition_functions and join to a numbers table to get a list, a set, a table set where I have one row for each of the numbers of my partitions. And I could join to this in a query.

One important thing to know is that when we join to a system table in a query, it prevents parallelism, and I learned that from Paul White from New Zealand. He has an excellent blog on SQL Server, and one of the things he mentioned is that joining to a system table will make your query single-threaded. If I was going to take this technique, I would want to do it in two steps where I put my fanout number from sys.partition_functions, I put that into a variable, and then I join to that lower in the query.

Ref.Numbers is not a system table. I just made that table.

But I can put this together with that T-SQL technique we have where I am building a table with one row per partition with the partition number in it. I’m using CROSS APPLY to say I want the MAX for every partition in the table by joining back to that list of partitions. This is a little bit more complicated than the query I started out with, right? ;) The T-SQL is perhaps not as easy to read, you might say. Let’s turn on our actual execution plans, and we’ll click Execute, and we did get good performance. Looking at our query time here, highlighting our SELECT operator, I have an elapsed time of 22 milliseconds on the second query, and that first query of the system table was super duper fast. I’m at 22 milliseconds.

I’m pretty happy with that. Looking down, we have that index scan operator.

It did look at all 56 partitions. We wanted it to because we’re saying go to every single one, and check, and find the max. It’s smart enough now that we rewrote the query to say, oh, I can just do a little backward scan of that. We feed all of those rows, and only 50 rows came back. I have some empty partitions. It did run it 56 times, but it didn’t find a row every time.

Those go back up, and it finds the max overall. I don’t know that I would call this an elegant solution, but it is the recommended workaround for getting good performance again out of that MAX operator, and this can also happen with TOP 1 queries if we rewrote it that way, or with a MIN query. If you don’t want to do this rewrite, I encourage you to vote up that bug on the Connect site.

There is another option for fixing it.

We can use the powers of our nonclustered columnstore index

Now, up to this point, notice I didn’t have to in this rewrite query, I didn’t have to tell it not to use the columnstore! With my original query, as soon as I started running against the partitioned table, I was using that hint that said ignore the nonclustered columnstore index. Let’s see how it does. I’m going to run against that index, and I don’t even know that I had the data in memory, so I’m going to run it again just to be fair.

It’s pretty darn fast, as you may have noticed. Here is my plan which did a columnstore index scan. Let’s right-click and look at the properties of that scan. This looked at our columnstore index with batch mode, and there’s something in this plan that is specific to Developer/Enterprise Edition. It shows me that it was able to do what’s called local aggregated rows. It was able to push down some of the aggregation operations deep into the process where it’s interacting with the columnstore index. That’s something that we’re not going to get in Standard Edition. This query is still really fast in Standard Edition against the nonclustered index.

Even with just the two cores that I’m allowed for batch mode in Standard Edition, it’s zooming fast against the nonclustered columnstore. My query time stats for this guy, my elapsed time here is 602 milliseconds. It’s not zero milliseconds, and I did burn some CPU, and I am, you know, I could have locking and blocking that I don’t have against the very simple rowstore instance, depending on what modifications are happening. But this columnstore index option is pretty darn fast, and it’s a great option to have, especially if you’re going to be using a columnstore index for other aggregate queries against the table. This is how I think of that ride along the columnstore index.

SQL Server can be really fast at scanning that thing, and that cool ability for Enterprise Edition to push down some of the aggregations into the columnstore index operation itself is really cool.

Here’s a picture of what the columnstore index scan looks like against my Standard Edition instance. Only got two threads doing that columnstore index scan, and I didn’t get those localized push down. But in this case, those are still pretty low amounts of milliseconds of time. I still can get way faster performance than an 11 second scan of every partition against a rowstore index.

Still not a bad deal for this guy in Standard Edition!

When it comes to these MAX/MIN problem queries, I like the option of rewriting the query if I still have that rowstore index against the partitioned table.

If I’m going for a columnstore index, I might not bother with that rewrite though, and I might just let it blast along the columnstore index as my fix. I prefer both of those options to creating a nonaligned nonclustered index. It might work in some cases, if I don’t need to do partition switching, but I would far rather have a fast query while still being able to have aligned indexes on my partitioned table.