Filling in Data Potholes with Recursive CTEs

Data: it can break your foot.

Imagine that you are writing a script that looks at data grouped by the minute. You notice that there are no rows for some minutes, and you’d like to display a value when that is the case, probably showing a count of zero.

In thinking about this problem this week, I spent some time getting to know CTEs (Common Table Expressions) again. And I came to the conclusion that I should spend much more time with them. Maybe I won’t end up using them all the time, but I should be looking at them regularly as options when I’m writing queries.

Here’s the story of a handy way I found to work with this.

Let’s create some data

Our story starts with some data. It’s been lovingly scripted out, but it has a few holes.

CREATE TABLE dbo.MyImperfectData (
ItemDate DATETIME2(0) ,
ItemCount SMALLINT )
GO

INSERT  dbo.MyImperfectData ( ItemDate, ItemCount )
VALUES  ( '2010-12-01 00:00:00', 12 ),
( '2010-12-01 00:01:00', 3 ),
( '2010-12-01 00:02:00', 6 ),
( '2010-12-01 00:03:00', 12 ),
( '2010-12-01 00:04:00', 24 ),
( '2010-12-01 00:05:00', 1 ),
-- Gap where 6 would be
( '2010-12-01 00:07:00', 122 ),
( '2010-12-01 00:08:00', 1 ),
( '2010-12-01 00:09:00', 1244 ),
( '2010-12-01 00:10:00', 23 ),
( '2010-12-01 00:11:00', 12 ),
( '2010-12-01 00:12:00', 24 ),
( '2010-12-01 00:13:00', 27 ),
( '2010-12-01 00:14:00', 28 ),
--Gap where 15, 16, 17 would be
( '2010-12-01 00:18:00', 34 ),
( '2010-12-01 00:19:00', 93 ),
( '2010-12-01 00:20:00', 33 ),
( '2010-12-01 00:21:00', 65 ),
( '2010-12-01 00:22:00', 7 ),
( '2010-12-01 00:23:00', 5 ),
--Gap where 24 would be
( '2010-12-01 00:25:00', 4 ),
( '2010-12-01 00:26:00', 6 ),
( '2010-12-01 00:27:00', 7 ),
( '2010-12-01 00:28:00', 77 ),
( '2010-12-01 00:29:00', 94 )

CREATE UNIQUE CLUSTERED INDEX cxMyCTE ON dbo.MyImperfectData(ItemDate)

The data is at the minute level. We’re missing data for five minutes in this period– one three minute chunk, and two other minutes.

What’s the quickest way to show the missing rows?

At first I thought about querying the data itself to find what’s missing. This made my head hurt a bit, and seemed pretty expensive.

I thought about the fact that many data warehouse databases have calendar tables, where all sorts of information about months, days, years, hours, and minutes are normalized out into tables.

However, I didn’t have those types of tables around. For the scope of my problem I was dealing with short date ranges (and by short, I mean 3 hours) , and ideally I would not need to create a bunch of ancillary objects to fill in the gaps.

After some thinking, I realized that we can create a date time table at the minute level on the fly by using a recursive CTE.

Here’s a sample that counts out a few minutes:

WITH    MyCTE
AS ( SELECT   CAST('2010-12-01 00:00:00' AS DATETIME2(0)) AS [I can count!]
UNION ALL
SELECT   DATEADD(mi, 1, [I can count!])
FROM     MyCTE
WHERE    [I can count!] < DATEADD(mi, -1,
CAST('2010-12-01 00:10:00' AS DATETIME2(0))) )
SELECT  [I can count!]
FROM    MyCTE
OPTION  ( MAXRECURSION 0 ) ;

Our results:

Putting it all together

Taking the format of this CTE, we can change it to create a table with every minute in our time range.

We can then select from it and use a LEFT OUTER JOIN to our table with data, and use the CTE dates to fill in the gaps.

DECLARE @startDate DATETIME2(0) ,
@endDate DATETIME2(0) ;

SELECT  @startdate = MIN(ItemDate), @endDate = MAX(ItemDate)
FROM    dbo.MyImperfectData ;

WITH    MyCTE
AS ( SELECT   @startDate AS MyCTEDate
UNION ALL
SELECT   DATEADD(mi, 1, MyCTEDate)
FROM     MyCTE
WHERE    MyCTEDate < DATEADD(mi, -1, @endDate) )
SELECT  MyCTEDate, CASE WHEN Itemcount IS NULL THEN '[Missing Row]'
ELSE ''
END AS ColumnDescription,
COALESCE(ItemCount, 0) AS ItemCount
FROM    MyCTE
LEFT OUTER JOIN dbo.MyImperfectData ld
ON MyCTE.MyCTEDate = ld.ItemDate
ORDER BY MyCTEDate
OPTION  ( MAXRECURSION 0 ) ;

And there we have it! No gaps:

No gaps allowed.

Use Cases

Check out the comments! In my initial posting, I didn’t say enough about where this is best applied, and how this scales.

I think this is mostly a party trick, but it’s also a nice simple example of recursion that got me thinking about CTEs. ¬†And while there are some situations where it can come in useful, it doesn’t scale up to large date ranges. (Check out Brad Schulz’ post on recursive CTEs here.)

So in other words, this may be helpful in some ad-hoc situations.

However, looking at the “pseudo-recursive” parts of Brad’s post, I really feel a follow-up post or two coming on.

Tags:

About Kendra Little

Kendra specializes in high availability and performance tuning. She is a Microsoft Certified Master in SQL Server-- the highest technical SQL Server Certification available. Kendra loves databases and software development more than long walks on the beach. Those cartoons in her blog posts? She draws 'em all. You should follow Kendra on Twitter: http://twitter.com/kendra_little

6 Responses to “Filling in Data Potholes with Recursive CTEs”

  1. Cade Roux December 23, 2010 at 8:32 am #

    Love CTEs – you can even put this in an inline table-valued function (have to use the OPTION clause in the outer call).

    There are sometimes performance issues, of course.

  2. Brad Schulz December 23, 2010 at 12:40 pm #

    Hi Kendra…

    Recursive CTE’s are really cool, but they produce tons of logical reads, because spools are used to store/re-read the data.

    For an example that you gave, it is fine, because it only covers a time span of 28 minutes (and therefore 29 rows were generated by the CTE). But if you change your script to have an expanse of a year (i.e. set @enddate to DATEADD(year,1,@startdate), generating over 500,000 rows, then the reads would skyrocket and performance would suffer.

    If you need more than a handful of rows, then using a Numbers/Tally table would be a more efficient approach to generate the individual times you need rather than the Recursive CTE. In a quick test I put together for a year-long span, the Recursive CTE generated 5.8million logical reads and used 7238ms of CPU, while a Numbers approach took 1million reads and only 1014ms of CPU.

    You can read more about Recursive CTE’s in my (admittedly) weird post from earlier this year:

    http://bradsruminations.blogspot.com/2010/03/this-article-on-recurson-is-entitled.html

    Best…

    –Brad

    • Kendra Little December 23, 2010 at 1:23 pm #

      Thanks for the comment, and for the link!

      I think I’ll edit the post to be more clear about use cases– in the case in question, I was dealing with a short time period– less than three hours, and I needed a quick fix.

      I *do* agree that numbers/ tally tables are great to have– makes for simpler code, too.

      I was thinking this would be most useful to users who need to query data where numbers/tally objects don’t exist, and they either don’t have access to create permanent objects or they need to do a short one-time adhoc script.

      I may see if I can do a follow up post with execution plans, for fun.

      • Brad Schulz December 23, 2010 at 1:33 pm #

        Agree completely… Happy Holidays!

Trackbacks/Pingbacks

  1. Filling in Data Potholes Redux: Tally Tables vs CTEs | LittleKendra.com - January 4, 2011

    […] … our heroine (that’s me) rediscovered CTEs, specifically in the recursive style. That was in my post “Filling in Data Potholes with Recursive CTEs.” […]

  2. Filling in Data Potholes Redux: Tally Tables vs CTEs | SQLServerPedia - January 4, 2011

    […] … our heroine (that’s me) rediscovered CTEs, specifically in the recursive style. That was in my post “Filling in Data Potholes with Recursive CTEs.” […]

Leave a Reply