Archive | Sample Code RSS feed for this section

There’s more than one way to skin an eggplant: Using APPLY for calculations

Choices, choices

Here’s a little TSQL snack. I picked this up in a presentation by Itzik Ben-Gan at the PNWSQL user group recently, and it’s become a fast favorite.

CROSS APPLY and OUTER APPLY- another use

The APPLY operator is perhaps more flexible than  you think. You may already know that you can use it to inline a function, or to replace a join.

But wait, there’s more! You can also use APPLY to perform calculations and simplify your query syntax– this is because the APPLY operator allows you to  express a calculation that can be referred to:

  • in further joins (which may or may not use APPLY)
  • by columns
  • in the where clause
  • in the group by

This is really helpful, because you can’t refer to the results of a computation in one column from anywhere but the ORDER BY.  This is because of the order of evaluation of parts of the statement.

I know this sounds confusing. It’ll make more sense in an example.

A sample query– the ‘before’ version

Here is a query written for the AdventureWorks sample database. There’s all sorts of examples that are possible for this, but I decided to go with one grouping data by month, using my favorite formula to round dates.

It shows the total quantity of orders by Product for an entire order month, for orders placed on or after 2004-07-01.

SELECT  DATEADD(MM, DATEDIFF(MM, 0, oh.OrderDate), 0) AS OrderDateMonth,
        p.Name AS ProductName,
        SUM(orderQty) AS totalQuantity
FROM    sales.SalesOrderHeader oh
JOIN    Sales.SalesOrderDetail od
        ON oh.SalesOrderID = od.SalesOrderID
JOIN    production.Product p
        ON od.ProductID = p.ProductID
WHERE   oh.OrderDate >= '2004-07-01'
GROUP BY DATEADD(MM, DATEDIFF(MM, 0, oh.OrderDate), 0),
        p.Name
ORDER BY OrderDateMonth,
        p.Name

Notice that to group the date at the month level, we need to include the calculation in the column in the column list, as well as in the group by clause.

The query rewritten using APPLY for the calculation

This can be rewritten with CROSS apply to move the calculation into the JOIN area and only specify it once.

The benefits: this will simplify your syntax and reduce the chance of typos and errors, particularly when you need to go in and change the calculation. In cases when you’re displaying a sum in one column and showing a percentage using it in another column, this trick is *fantastic*. (Query numbers from the DMVs a lot? you’ll love this.)

Here, the calculation on the date is moved into the cross apply. It can be referenced as oh1.OrderDateMonth in both the list of columns, and in the GROUP BY portion of the query without rewriting the calculation.

SELECT  oh1.OrderDateMonth,
        p.Name AS ProductName,
        SUM(orderQty) AS totalQuantity
FROM    sales.SalesOrderHeader oh
CROSS APPLY ( SELECT    DATEADD(MM, DATEDIFF(MM, 0, oh.OrderDate), 0) AS OrderDateMonth ) AS oh1
JOIN    Sales.SalesOrderDetail od
        ON oh.SalesOrderID = od.SalesOrderID
JOIN    production.Product p
        ON od.ProductID = p.ProductID
WHERE   oh.OrderDate >= '2004-07-01'
GROUP BY oh1.OrderDateMonth,
        p.Name
ORDER BY OrderDateMonth,
        p.Name

What does the execution plan look like?

Click for a larger image

The execution plan for these two queries are identical.

In this case, the optimizer looks at these two queries and realizes the activities it needs to do will be the same.

Other options

You can create further CROSS APPLY or OUTER APPLY joins that refer to computations in prior joins.

You can also refer to the resulting computation in the where clause.

But be careful….

As with anything, you want to make sure you’re getting a good execution plan, and not shooting yourself in the foot with a new trick.

One big area to watch: although you can refer to these computations conveniently in the WHERE clause, you still want to be careful you’re using appropriate criteria.

For instance, if we were to change the example above to refer to the result from the CROSS APPLY oh1 in the where clause like this:

SELECT  oh1.OrderDateMonth ,
        p.Name AS ProductName ,
        SUM(orderQty) AS totalQuantity
FROM    sales.SalesOrderHeader oh
CROSS APPLY ( SELECT    DATEADD(MM, DATEDIFF(MM, 0, oh.OrderDate), 0) AS OrderDateMonth ) AS oh1
JOIN    Sales.SalesOrderDetail od
        ON oh.SalesOrderID = od.SalesOrderID
JOIN    production.Product p
        ON od.ProductID = p.ProductID
WHERE   oh1.OrderDateMonth >= '2004-07-01'  ---Don't do this!
GROUP BY oh1.OrderDateMonth ,
        p.Name
ORDER BY OrderDateMonth ,
        p.Name

… then in this case the query would not be able to use an index on OrderDate on the sales.SalesOrderHeader table, if one exists.

This is not specifically because of the CROSS APPLY, but because we are forcing SQL Server to apply the functions to every value to identify if it satisfies the criteria. That prevents a seek.

Comments { 2 }

Dynamic Management Quickie: Exploring SQL Server’s system views and functions as you work

Note: If you like this post, here’s one you’re going to like even better! Check out DMV/DMF Info Just A Couple Clicks Away by Brad Schultz.

There’s a lot of dynamic management and system objects to keep track of in SQL Server.

We all sometimes have the moment when we can’t remember exactly which DMV, DMF or other system view/function returns a particular column, or if something even IS accessible from the system objects.

When this happens, remember that it’s easy to query system object and column names. Sys.system_columns and sys.system_objects are here to help.

Exploring the system views and functions yourself will also help you find new things.

This example shows all the system views and functions which are likely to have to do with CPU:

SELECT
SCHEMA_NAME(SCHEMA_ID)+ '.' + o.Name AS SysObjName,
o.type_desc,
c.name AS ColumnName
FROM sys.system_columns c
INNER JOIN sys.system_objects o ON o.object_id=c.object_id
WHERE c.name LIKE '%cpu%'

I like to use this version of the query, which includes the URL to look up more about the DMV. I like to use the browser in SSMS itself to look these up, so I include the shortcut for that in the header.

SELECT
	SCHEMA_NAME(SCHEMA_ID)+ '.' + o.Name AS SysObjName,
	o.type_desc,
	c.name AS ColumnName,
	'http://social.msdn.microsoft.com/Search/en-US/?Refinement=117&Query=' + SCHEMA_NAME(SCHEMA_ID)+ '.' + o.Name AS [Help! Ctrl + ALT + R to open web browser in SSMS]
FROM sys.system_columns c
INNER JOIN sys.system_objects o ON o.object_id=c.object_id
WHERE c.name LIKE '%page%'

The output looks like this (click for a larger view):

I’ve started using this lately since it helps me explore as well as remember.

Just when you think you know everything about the system objects, you’ll find something new.

Fun example: look at all the columns like ‘%page%’.

Comments { 8 }

How To: Automate Continuous SQLServer Activity with Stored Procedures and Powershell Jobs

The Goal

It’s often useful to be able to run a bunch of stored procedures in the background over a period of time against a test instance.

This can be nice for:

  • Demos and presentations.
  • Populating DMVs with data you can slice and dice.
  • Learning to use things like extended events and server side trace (which are much more interesting with something to look at).
  • Testing a variety of automation scripts.

This post shows you how to create several stored procedures for AdventureWorks2008R2 which will provide different result sets and have slightly different run times when run with a variety of parameters– in this case, individual letters of the alphabet.

You can then run PowerShell commands which start jobs in the background. Each job runs a stored procedure and loops through all letters of the alphabet, providing each one as a parameter. You can set the job to do that loop a configurable amount of times (the commands are set to 100). In other words, as given, each stored procedure will be run 2600 times. Since you’re running multiple jobs and they’re all going asynchronously in their own threads, you’ll have a variety of commands trying to run at the same time.

Optional: you can start the PowerShell jobs under different credentials if you need.

Alternatives: In the past, I’ve typically done things like this with T-SQL loops (often with dynamic SQL) and multiple Management Studio windows. This works OK, but it’s a little time consuming to open each window, paste everything in (or open multiple files), and start them all up. I find it much more convenient now to use scripts.

Step 1: Create Stored Procedures with a single alphabet-based parameter

Let’s get one thing clear: these procedures aren’t designed to run optimally, and they aren’t coded nicely.

You’ll notice these procedures have all sorts of problems. And that’s by design– my goals are for testing things around these, so it’s really a little better for me if they don’t play perfectly nice.

In other words, these sure ain’t for production. :)

/****************
Jump in the kiddie pool
********************/
USE AdventureWorks2008R2;
go

/****************
CREATE THE SCHEMA
********************/
IF SCHEMA_ID(N'test')  IS NULL
	EXEC sp_executesql N'CREATE SCHEMA test AUTHORIZATION dbo'
GO

/****************
CREATE Silly Stored Procedures in the Schema
********************/
IF OBJECT_ID(N'test.EmployeeByLastName', 'P') IS NULL
	EXEC sp_executesql N'CREATE PROCEDURE test.EmployeeByLastName as return 0'
GO
ALTER PROCEDURE test.EmployeeByLastName
	@lName nvarchar(255)
AS
	SELECT @lName = N'%' + @lName + N'%'

	select *
	FROM HumanResources.vEmployee
	WHERE LastName LIKE @lName
GO

IF OBJECT_ID(N'test.EmployeeByFirstName', 'P') IS NULL
	EXEC sp_executesql N'CREATE PROCEDURE test.EmployeeByFirstName as return 0'
GO
ALTER PROCEDURE test.EmployeeByFirstName
	@fName nvarchar(255)
AS
	SELECT @fName = '%' + @fName + '%'

	select *
	FROM HumanResources.vEmployee
	WHERE FirstName LIKE @fName
GO

IF OBJECT_ID(N'test.EmployeeDepartmentHistoryByLastName', 'P') IS NULL
	EXEC sp_executesql N'CREATE PROCEDURE test.EmployeeDepartmentHistoryByLastName as return 0'
GO
ALTER PROCEDURE test.EmployeeDepartmentHistoryByLastName
	@lName nvarchar(255)
AS
	SELECT @lName = N'%' + @lName + N'%'

	select *
	FROM HumanResources.vEmployeeDepartmentHistory
	WHERE LastName LIKE @lName
GO

IF OBJECT_ID(N'test.EmployeeDepartmentHistoryByFirstName', 'P') IS NULL
	EXEC sp_executesql N'CREATE PROCEDURE test.EmployeeDepartmentHistoryByFirstName as return 0'
GO
ALTER PROCEDURE test.EmployeeDepartmentHistoryByFirstName
	@fName nvarchar(255)
AS
	SELECT @fName = '%' + @fName + '%'

	select *
	FROM HumanResources.vEmployeeDepartmentHistory
	WHERE FirstName LIKE @fName
GO

IF OBJECT_ID(N'test.ProductAndDescriptionByKeyword', 'P') IS NULL
	EXEC sp_executesql N'CREATE PROCEDURE test.ProductAndDescriptionByKeyword as return 0'
GO
ALTER PROCEDURE test.ProductAndDescriptionByKeyword
	@keyword nvarchar(255)
AS
	SELECT @keyword = '%' + @keyword + '%'

	select *
	FROM Production.vProductAndDescription
	WHERE Name LIKE @keyword OR ProductModel like @keyword OR description LIKE @keyword
GO

Once you’ve got the procedures written, you just need to set up your PowerShell commands.

Step 2: Create PowerShell Jobs to Run the Procedures in Loops

These commands use PowerShell background jobs.

Even if you don’t know PowerShell, if you look at these commands you can pretty easily pick out where the 1 to 100 loop is, where the a to z loop is, and what commands are being run.

Since the jobs are running to create load in the background and I don’t care about collecting query results, I pipe the output all to Out-Null.

#test.EmployeeByLastName
Start-Job -ScriptBlock {Import-Module sqlps; foreach($_ in 1..100) {foreach ($_ in [char]"a"..[char]"z") {Invoke-Sqlcmd -Query "exec test.EmployeeByLastName '$([char]$_)'" -ServerInstance "YOURMACHINE\YOURINSTANCE" -Database AdventureWorks2008R2 | Out-Null }}}

#"test.EmployeeByFirstName"
Start-Job -ScriptBlock {Import-Module sqlps; foreach($_ in 1..100) {foreach ($_ in [char]"a"..[char]"z") {Invoke-Sqlcmd -Query "exec test.EmployeeByFirstName '$([char]$_)'" -ServerInstance "YOURMACHINE\YOURINSTANCE" -Database AdventureWorks2008R2 | Out-Null }}}

#"test.EmployeeDepartmentHistoryByFirstName"
Start-Job -ScriptBlock {Import-Module sqlps; foreach($_ in 1..100) {foreach ($_ in [char]"a"..[char]"z") {Invoke-Sqlcmd -Query "exec test.EmployeeDepartmentHistoryByFirstName '$([char]$_)'" -ServerInstance "YOURMACHINE\YOURINSTANCE" -Database AdventureWorks2008R2 | Out-Null }}}

#"test.EmployeeDepartmentHistoryByLastName"
Start-Job -ScriptBlock {Import-Module sqlps; foreach($_ in 1..100) {foreach ($_ in [char]"a"..[char]"z") {Invoke-Sqlcmd -Query "exec test.EmployeeDepartmentHistoryByLastName '$([char]$_)'" -ServerInstance "YOURMACHINE\YOURINSTANCE" -Database AdventureWorks2008R2 | Out-Null }}}

#"test.ProductAndDescriptionByKeyword"
Start-Job -ScriptBlock {Import-Module sqlps; foreach($_ in 1..100) {foreach ($_ in [char]"a"..[char]"z") {Invoke-Sqlcmd -Query "exec test.ProductAndDescriptionByKeyword '$([char]$_)'" -ServerInstance "YOURMACHINE\YOURINSTANCE" -Database AdventureWorks2008R2 | Out-Null }}}

Each command will start an asynchronous background job.

Step 3: Manage Jobs (if needed)

Once the jobs are running in the background, you may want to check on their status. You can do so by running:

get-job

if you want to remove a job from the list, you can use Remove-Job with the job number, or you can remove all jobs (whether or not they are running) with:

Remove-Job * -Force

If you want to see the output of a job, you can use Receive-Job– supply the jobnumber. If you’re troubleshooting and want to see errors, you probably want to remove | Out-Null from the command that starts the job, and use a fewer number of loops. Then you can can receive the job’s output and see any errors.

Receive-Job JOBNUMBER
Comments { 6 }

Dirty Pages and Statistics IO

Warning: The DROPCLEANBUFFERS command referenced in this post is appropriate for test systems only and impacts the entire SQL Instance. If you are new to SQL Server, please use this commands with care, and be careful to read the linked Books Online documentation. Happy testing!

You were hoping for a picture, right?

The other day I was running some test queries and looking at the number of reads, and I noticed something funny.

I was dropping clean buffers prior to running a query, but I would sometimes see that there had been no physical reads.

No physical reads? Where was the data coming from?

I was working on a small number of rows, but it still bothered me.

The output looked like this:

The Set-Up

Here’s a simple simulation of what I was doing. First, create a database and insert some values.

SET NOCOUNT ON;
SET STATISTICS IO OFF;
create database dirtyBuffers
GO
USE dirtyBuffers
GO
--Create a table and insert some values
create table dbo.testme (
	i int identity,
	j char(2000) default 'baroo'
)
GO
insert dbo.testme default values
GO 20

Then, turn on Statistics IO so we can see read information. Drop clean buffers, so data isn’t in memory. Then run a query.

SET STATISTICS IO ON;

DBCC DROPCLEANBUFFERS
GO

--Select some rows
select * from dbo.testme

It should read it from disk, right?

 

What I Forgot

I was forgetting about dirty pages. In  order to get a “cold cache”, you need to first run a CHECKPOINT command to flush dirty pages to disk, then run DBCC DROPCLEANBUFFERS to remove everything from the bufferpool. This is very well documented in Books Online.

This was easy to forget because typically I test execution of queries against a restored copy of a production database, or a dataset which isn’t changing.

What I Hadn’t Realized

I don’t think I ever specifically realized that dirty pages could be immediately re-used for query results– but it makes perfect sense. I had only thought about clean pages, which were read in for one query, to be available for re-use.

I felt a little silly when I realized this. Shouldn’t I have known this? But after thinking about it I realized: there’s little gaps like this in most everyone’s knowledge. Sometimes it takes a little bit of extra experience to notice the gap and fill it in. It happens to us all.

After rerunning the commands and including a CHECKPOINT with DBCC DROPCLEANBUFFERS, I see the expected output– a physical read.

Comments { 3 }
Corrupting databases is a lot like eating paste.

More Running with Scissors: Corrupting your Database with 823 Errors

Corrupting databases is a lot like eating paste. Delicious, delicious paste.

This weekend, a question came up on Twitter asking if there was an easy way to simulate an 823 error. It seemed like a fun task to figure out.

In a  previous post, I showed  how to corrupt your database with a Hex editor to cause 824 errors.

What’s an 823 Error?

An 823 error in SQL Server is a severe error that occurs when accessing a database file. It’s described in detail in KB2015755.

This is Useful!

This is useful for learning about corruption, and practicing responses to corruption events.
You can also use this to test configuration scripts you have for database mail, operators, and alerts, to make sure the alerts are working properly.

Disclaimer: these scripts are for test environments only/as always use these at your own risk and be careful not to eat paste.

Step 1: Create a share in Windows on your test machine

First, you need to create a shared folder in Windows.
In order for later steps to work, grant your SQL Server instance’s service account full control over the share.

Step 2: Map a the share to a network drive from the SQL Instance

The ‘Net use’ command is specific to a profile, so the easiest way to handle this is to enable XP_CMDSHELL and map the drive from SQL Server itself.

--Oh no, security nightmare!
sp_configure 'xp_cmdshell', 1
RECONFIGURE

EXEC xp_cmdshell 'net use "Y:" "\\MYMACHINE\z_testnetworkDrive" /PERSISTENT:NO'

Step 3: Create a test database with a filegroup, file, and table  on the network drive

Now we just need to create a database, and create a table on the drive.

--Create test database and add a filegroup and table on the network share
IF  db_id('TestMe') IS NOT NULL
BEGIN
	USE master;
	ALTER DATABASE TestMe SET SINGLE_USER WITH ROLLBACK IMMEDIATE
	DROP DATABASE TestMe
END

CREATE DATABASE TestMe
go

ALTER DATABASE TestMe ADD FILEGROUP FG1
ALTER DATABASE TestMe ADD FILE (NAME=f1, FILENAME='Y:\f1.ndf', SIZE=128MB)TO FILEGROUP FG1 

USE TestMe
go
CREATE TABLE t1 (
	i INT IDENTITY,
	j CHAR(200) DEFAULT 'x'
) ON FG1
go
INSERT t1 DEFAULT VALUES
GO 20

--Flush everything to disk
CHECKPOINT
GO

Step 4: Start a Loop of Reads in Another Connection

Now we want to simulate reads. Open up a new connection against your instance, and run the following commands to repeatedly read data from the t1 table. We use DROPCLEANBUFFERS to make sure we’re reading from disk each time. (We already ran a checkpoint to flush the writes to disk.)

--Run this in another connection
SET NOCOUNT ON;
GO
BEGIN
	DBCC DROPCLEANBUFFERS

	SELECT * FROM t1
END
GO 50000

Step 5: Disconnect the network drive, and voila! 823 Error.

Now, back in your first connection, disconnect the network drive with the following command:

EXEC xp_cmdshell 'net use "Y:" /DELETE /Y'

Your connection which is running reads should fail with an error like this:

The connection will automatically be terminated when the error occurs.

Don’t forget to disable XP_CMDSHELL

Like so:

sp_configure 'xp_cmdshell', 0
RECONFIGURE

That’s better.

Activity: Recover from the corruption, without bringing the network drive back online

To fully do the activity, you’ll want to add some database backups in before the “corruption” event of disconnecting the network drive. You may want to combine full, differential, and/or log backups, and change data in the table at various points between (and after) backups.

Then, practice bringing things back online. How much data will be lost in each scenario? How quickly can you bring the database online?

Another Solution- The USB Drive

You can also do this by creating the filegroup, file, and table on a USB stick, and removing the USB stick instead of unmapping the network drive.

However, I preferred this example since it’s easy to re-run from management studio itself, and no additional physical devices are required.

Comments { 2 }

Be Prepared: Collecting Data from sp_WhoIsActive in a Table

I am a fangirl of Adam Machanic‘s sp_whoisactive stored procedure.

Be prepared. When things go wrong, you want to focus on the problem itself. Make a plan for how you

One of the great things about sp_whoisactive is that it’s very quick to generate a table to collect the data you need, then run the procedure in a loop with a wait statement to regularly collect the data you want over an interval. This usage is documented (see Adam’s post here), but I’ve chatted with several people who either didn’t know you could do this, or didn’t have a script saved.

What’s sp_whoisactive?

Sp_whoisactive is a procedure written by Adam Machanic. It can be found here: sp_whoisactive downloads

What to Do Today

Do your future self a favor: Today, set yourself up the following:

  • Create the sp_whoisactive stored procedure in a user database you use only for administration
  • Play with the script I provide below and look at the  parameters of the stored procedure
  • Save a script which creates a table for your preferred version of sp_whoisactive, then runs in a loop to store the data.

Why Do This?

Even if you’re already running monitoring tools, using sp_whoisactive to sample data from the DMVs (including query plans) can be invaluable.

When things go wrong, terribly wrong, you want to have this script in your back pocket to collect data about activity on your SQL instance in the background while you’re focusing on the problem. This will help you resolve the issue as quickly as possible, by referring periodically to the data it’s collecting. Since the data is stored, you can also use it afterward to explain the situation and prevent it from happening again.

Step 1: Create Your @destination_table

Let me just note again: this is not my original work. I don’t want to take credit for it, but I do want to advocate for it!

First, create your destination table. Sp_whoIsActive lets you generate a destination table easily. Here is a basic version which includes the current date in the table name, as well as three flags to get transaction info (tlog write info and duration), query plans, and block leaders:

DECLARE @destination_table VARCHAR(4000) ;
SET @destination_table = 'WhoIsActive_' + CONVERT(VARCHAR, GETDATE(), 112) ;

DECLARE @schema VARCHAR(4000) ;
EXEC sp_WhoIsActive
@get_transaction_info = 1,
@get_plans = 1,
@find_block_leaders = 1,
@RETURN_SCHEMA = 1,
@SCHEMA = @schema OUTPUT ;

SET @schema = REPLACE(@schema, '<table_name>', @destination_table) ;

PRINT @schema
EXEC(@schema) ;

Step 2: Create Your Loop to Periodically Log Data

You need to need to make sure to specify the same parameters for sp_whoisactive in this loop, so the output matches the schema for the table you’ve created.

This does 10 runs with a 5 second wait between runs. Depending on what’s going on, I change those numbers accordingly.

DECLARE
    @destination_table VARCHAR(4000) ,
    @msg NVARCHAR(1000) ;

SET @destination_table = 'WhoIsActive_' + CONVERT(VARCHAR, GETDATE(), 112) ;

DECLARE @numberOfRuns INT ;
SET @numberOfRuns = 10 ;

WHILE @numberOfRuns > 0
    BEGIN;
        EXEC dbo.sp_WhoIsActive @get_transaction_info = 1, @get_plans = 1,
            @find_block_leaders = 1, @DESTINATION_TABLE = @destination_table ;

        SET @numberOfRuns = @numberOfRuns - 1 ;

        IF @numberOfRuns > 0
            BEGIN
                SET @msg = CONVERT(CHAR(19), GETDATE(), 121) + ': ' +
                 'Logged info. Waiting...'
                RAISERROR(@msg,0,0) WITH nowait ;

                WAITFOR DELAY '00:00:05'
            END
        ELSE
            BEGIN
                SET @msg = CONVERT(CHAR(19), GETDATE(), 121) + ': ' + 'Done.'
                RAISERROR(@msg,0,0) WITH nowait ;
            END

    END ;
GO

Step 3: Set Up Your Query to Look at the Results

This is optional, but since the table is named by datestamp, I like to keep a query on hand to look at the results without having to copy and paste the table name. I also like to put the collection_time column first and sort by it.

DECLARE @destination_table NVARCHAR(2000), @dSQL NVARCHAR(4000) ;
SET @destination_table = 'WhoIsActive_' + CONVERT(VARCHAR, GETDATE(), 112) ;
SET @dSQL = N'SELECT collection_time, * FROM dbo.' +
 QUOTENAME(@destination_table) + N' order by 1 desc' ;
print @dSQL
EXEC sp_executesql @dSQL
Comments { 16 }

Internals Matter: Why You Should Check Your Page Verify Settings, and I Should Go to Masters Immersion Training (SQL Skills Contest Entry)

This post is about two things:

1) Your Page Verification Settings are Critical

You should confirm you are running with the correct page verification settings on your SQL Server databases. It’s very important, and I’ll show you why.

2) I Should Attend the Awesome SQLSkills Master Immersion Event – Internals and Performance

My scripts to demo the importance of page verification settings are part of  my entry to win free registration to the SQLSkills Master Immersion Event on Internals and Performance in February. Read more about the training here.

Keep reading to see why I hope this helps me win.

Let’s Start with Page Verification

The default setting for the PAGE_VERIFY database option for SQL 2000 was TORN_PAGE_DETECTION. As Books Online explains here:

When TORN_PAGE_DETECTION is specified, a specific 2-bit pattern for each 512-byte sector in the 8-kilobyte (KB) database page is saved and stored in the database page header when the page is written to disk. When the page is read from disk, the torn bits stored in the page header are compared to the actual page sector information.

SQL 2005 introduced a new page verify database option, CHECKSUM. This was the default for new databases, but the page verification option was not automatically changed when databases were migrated or upgraded from SQL 2000 to SQL 2005. Books Online explains that when you run with CHECKSUM…

the Database Engine calculates a checksum over the contents of the whole page and stores the value in the page header when a page is written to disk. When the page is read from disk, the checksum is recomputed and compared to the checksum value stored in the page header.

It is quite common to find databases on SQL 2005 instances and higher still using the TORN_PAGE_DETECTION setting for page verification. When a change to the setting is suggested, people naturally ask if the extra protection is worth the small amount of extra overhead of computing the checksum.

It certainly is! This post demonstrates why.

To Play Along At Home…

You may want to read my prior post on causing database corruption with a hex editor here. This post explains how to edit data in unsupported ways, and these techniques should never be used near a production server, or a database of any value at all. Make sure to wash your hands before and after using these practices, and having a change of clothes handy is also advisable.

You’ll also need to download the XVI32 hex editor by Christian Maas. No installer is necessary: download the zip file, then unzip all files to a directory and run XVI32.exe

Our Victim Database

Let’s say we have a database with entry data for a contest. The contest is, quite coincidentally, the SQLSkills Master Immersion Event on Internals and Performance in February!

Let’s create our database and schema for our contest entries.

USE master ;
IF DB_ID('CorruptMe') IS NOT NULL
    BEGIN
        ALTER DATABASE CorruptMe SET SINGLE_USER WITH ROLLBACK IMMEDIATE
        DROP DATABASE CorruptMe
    END
CREATE DATABASE CorruptMe ;
GO
ALTER DATABASE CorruptMe SET PAGE_VERIFY TORN_PAGE_DETECTION ;
GO
USE CorruptMe ;
--This schema might have an issue or two ;)  Just play along.
CREATE TABLE dbo.Contest (
    ContestId INT IDENTITY,
    ContestName NVARCHAR(256),
    CONSTRAINT cxContestId_Contest UNIQUE CLUSTERED (ContestId)
 )
CREATE TABLE dbo.ContestEntry (
    ContestEntryId INT IDENTITY,
    ContestId INT REFERENCES dbo.Contest ( ContestId ) ,
    FirstName NVARCHAR(256) ,
    LastName NVARCHAR(256) ,
    ContestStatus NVARCHAR(256),
    CONSTRAINT cxContestEntryId_ContestEntryId UNIQUE CLUSTERED (ContestEntryId)
)

INSERT  dbo.Contest ( ContestName)
VALUES ( '5-Day Internals and Performance class in Dallas, February 21-25')

INSERT  dbo.ContestEntry ( ContestId, FirstName, LastName, ContestStatus )
VALUES ( 1, 'Hello', 'Kitty', 'Win!' ),
	( 1, 'PePe', 'LePeu', 'Loss' ),
	( 1, 'Kendra', 'Little', 'Sigh' )

It looks like the contest has already been judged. How did everyone do?

SELECT c.ContestName, ce.FirstName, ce.LastName, ce.ContestStatus
FROM dbo.Contest c
JOIN dbo.ContestEntry ce ON c.ContestId=ce.ContestId

Well, it figures. Hello Kitty can be pretty persuasive. Congrats, Kitty.

But What If…

What if something happened to the data? The database is running with TORN_PAGE_DETECTION– let’s see what happens if we change some of the values  in the physical data (mdf) file with a hex editor.

First, take a look at the pages for dbo.ContestEntry:

DBCC IND ('CorruptMe', 'ContestEntry', 1)

On my instance, we have one data page in our clustered index (PageType1). It’s PagePID 147. Let’s see if we can make some changes on that page and if SQL Server will notice.

To do this, take the database offline, and figure out your physical file name and your offset:

USE master ;
ALTER DATABASE CorruptMe SET OFFLINE ;

SELECT physical_name FROM sys.master_files WHERE name = 'CorruptMe' ;
SELECT 147 * 8192 AS [My Offset]

Then run the XV134 Hex editor with administrative privileges, and open the file.
Use Address -> GoTo and enter your offset as calculated above. (In my case: 1204224).

Here’s the data as I found it on the page (click to view in a new window):

And here is the data after I made quite a few changes in the hex editor, just by typing in the right pane (click to view in a new window):

That couldn’t possibly work, could  it?

Let’s put the database back online and see.

ALTER DATABASE CorruptMe SET ONLINE
GO
USE CorruptMe
GO
SELECT c.ContestName, ce.FirstName, ce.LastName, ce.ContestStatus
FROM dbo.Contest c
JOIN dbo.ContestEntry ce ON c.ContestId=ce.ContestId

This returns the following. The values on the page aren’t just in a different order, they’re different than the original values:

What About When I Run the Next CheckDb?

Let’s see:

DBCC CHECKDB('CorruptMe')

This comes back clean. An excerpt from the output is:

What Would Happen if We Were Using Page Verify CHECKSUM?

You can see this by running through the demo again. This time, use CHECKSUM as your database PAGE_VERIFY option, either by default or by running this after it is created:

ALTER DATABASE CorruptMe SET PAGE_VERIFY CHECKSUM;

After editing the page as described in the hex editor, you will be able to bring the database back online successfully. However, as soon as you read the records from the dbo.ContestEntry table, you’ll get an 824 error for the page we edited, like this:

Msg 824, Level 24, State 2, Line 1
SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x54bc84f5; actual: 0×41298172). It occurred during a read of page (1:147) in database ID …

CheckDb will also alert you to the error the first time it is run.

The Moral of the Story Is: CHECKSUM is the Way to Go For Page Verify

This isn’t about security– this is about you knowing if changes to SQL Server’s files are happening outside of its control, whether by corruption or any other means.

It’s very easy to check if you’re using CHECKSUM. Take a look at your settings today, and make a plan to move to checksum for any databases which need it!

SELECT NAME
FROM sys.databases
WHERE page_verify_option_desc <> 'CHECKSUM'

Ideally you will include an index rebuild in your upgrade plan, or shortly thereafter– because the checksum needs to be written to each page, this doesn’t start working for a page  until it’s had a write.

For all sorts of more information, see Paul Randal’s post on Myths around Checksums.

Meanwhile, I’m Entering A Contest

Back to our contest for the SQLSkills Master Immersion Event on Internals and Performance in February (info).

My challenge was to express why I’d like to the go to the class, and why I’ll make the best use of the knowledge I gain.

I’d love to go because:

  • I’m passionate about learning about internals and performance. Hopefully I’ve demonstrated that here!
  • I still have a lot to learn, and there’s no replacement for a great classroom experience.
  • I know SQLSkills training is highly interactive and is a great place to ask questions and get real-world, in-depth answers.
  • This training will help me grow as a professional.

I’ll make the best use of the knowledge because:

  • I love using what I learn to explore more and write blog posts like this– it helps me learn more and share with the community.
  • I’m now the *only DBA* at a small, clever internet business. Every bit of expertise I can gain can help me work as part of a great team and make a real difference in our business.
  • I’m good at asking questions and helping others learn– I work hard to be a great fellow student and learn together with others.
  • I’ve clearly learned enough to be dangerous ;)

What I’d Like to Do at the Training

I would love to blog my way through my first immersion event. I certainly couldn’t (and wouldn’t) capture all the content we cover, but I’d like to capture what the experience is like. I’d also like to characterize how it  helps me think about, and work  with, SQL Server differently.

I hope whomever is selected as the winner chooses to share their experience.

Comments { 5 }

Read from the Right End of the Index: BACKWARD Scans

Optimizing queries is the most fun when you don’t need to add indexes. There’s nothing quite so nice as finding a way to make reading data faster, without slowing down writes or creating new data structures that need to be maintained.

Here’s one way you can use BACKWARD scans to do this.

The Scenario: Clustered index on an increasing integer, and you’d like recently created rows

This is a common enough situation: you have a table with a clustered index on an integer value which increases with each row. You have another column which records the date the row was created.

You’d like frequently query the most recently created rows over some period of time.

The table has very frequent inserts, so for performance reasons you want to use the minimal indexes required. (And in general, this is the best practice.)

Question: Do you need to add a nonclustered index on the column containing the date the row was created?

Answer: Maybe not!

Getting the right clustered index scan

Say we’re working with the following table, which we have filled with five million rows of Tweetie birds. (Note: This generation technique is a tally table population technique which I found on Stack Overflow, which is attributed to Itzik Ben-Gan.)

CREATE TABLE dbo.Birds (
    birdId INT NOT NULL ,
    birdName NVARCHAR(256) NOT NULL,
    rowCreatedDate DATETIME2(0) NOT NULL )
GO	

--Insert 5 million Tweetie birds
--Make them as if they were all created a minute apart.
;WITH
  Pass0 as (select 1 as C union all select 1),
  Pass1 as (select 1 as C from Pass0 as A, Pass0 as B),
  Pass2 as (select 1 as C from Pass1 as A, Pass1 as B),
  Pass3 as (select 1 as C from Pass2 as A, Pass2 as B),
  Pass4 as (select 1 as C from Pass3 as A, Pass3 as B),
  Pass5 as (select 1 as C from Pass4 as A, Pass4 as B),
  Tally as (select row_number() over(order by C) as Number from Pass5)
INSERT dbo.Birds (birdId, birdName, rowCreatedDate)
SELECT Number AS birdId ,
    'Tweetie' AS birdName ,
    DATEADD(mi, number, '2000-01-01')
FROM Tally
WHERE Number <= 5000000

--Cluster on BirdId. We won't add any other indexes.
CREATE UNIQUE CLUSTERED INDEX cxBirdsBirdId ON dbo.Birds(BirdId)

Say we would just like to see the maximum value in the rowCreatedDate column.

The most basic way to get this row is with this query:

SELECT MAX(rowCreatedDate)
FROM dbo.Birds

However, that leads to a table scan. We get lots of reads: 22,975 logical reads and 201 physical reads.

If we know we have a strong association between the BirdId column and the RowCreatedDate column, and that the highest ID in the table is the most recent row, we can rewrite the query like this:

SELECT MAX(rowCreatedDate)
FROM dbo.Birds
WHERE birdId = (SELECT MAX(birdId) FROM dbo.Birds)

This query still does a clustered index scan. But yet it does only 3 logical reads and 2 physical reads.

Looking in the execution plan, our query was able to use the extra information we provided it to scan the index backwards. It stopped when it had everything it needed, which was after a short distance– after all, it only needed recent rows, and those are all at one end of the table.

This backwards scan can be very useful, and can make using the MAX aggregate very useful.

But you usually need more than just the max value…

To see a bit more about how you extend this logic, compare these three queries:

Query A

This makes you think you need that non-clustered index: it does 22,975 logical reads, 305 physical reads, and 22968 read-ahead reads.

--Only run against a test server, not good for production
DBCC DROPCLEANBUFFERS

SELECT birdId, birdName, rowCreatedDate
FROM dbo.Birds
WHERE rowCreatedDate >= '2009-07-01 05:00:00'

Query B

We can introduce the backwards scan by adding an ORDER BY BIrdId DESC to the query. Now we get 23019 logical reads, 47 physical reads, and 22960 read-ahead reads.

--Only run against a test server, not good for production
DBCC DROPCLEANBUFFERS

SELECT birdId, birdName, rowCreatedDate
FROM dbo.Birds
WHERE rowCreatedDate >= '2009-07-01 05:00:00'
ORDER BY birdid desc

Query C

The this last query gives the optimizer extra information about using BirdId to do a BACKWARD scan to grab the maximum BirdId, and then use that to do a BACKWARD seek of the clustered index in nested loops to get the data. It does only 50 logical reads, 4 physical reads, and 817 read-ahead reads.

--Only run against a test server, not good for production
DBCC DROPCLEANBUFFERS

SELECT birdId, birdName, rowCreatedDate
FROM dbo.Birds
WHERE birdId >=
	(SELECT MAX(birdId)
	FROM dbo.Birds
	WHERE rowCreatedDate <= '2009-07-01 05:00:00')
AND rowCreatedDate >= '2009-07-01 05:00:00'
ORDER BY birdId DESC

Be Careful Out There

The examples I’m using work because there is a correlation between the integer field and the date field. Not all tables may be like this. As with all queries, you need to be familiar with your data.

Consider Your Options– Even the Ones You Don’t Think Are Great

I’m quite sure BACKWARD index reads are covered in some talks and publications on tuning.  But I learned about this by considering multiple approaches, even those I didn’t think would work at first. It pays to try things out, and you can look a lot by looking carefully at execution plans (including the properties) and your Statistics IO output.

What this means to me: it’s good to keep an open mind.

Comments { 4 }