Dear SQL DBA, What do you say to a SAN admin when you think that the billion dollar SAN *may* be the bottleneck and you just want to look into it. What are the technical things I need to say to make them believe there might be something to my questions?
Note: This is a “listen-able” 36 minute video. You can also listen to this as a podcast - learn how at littlekendra.com/dearsqldba.__
This question is near and dear to my heart
I’ve been called in a lot as a consultant with the question, “Is the storage slowing us down? If so, what do we do?” This is really mystifying to a lot of people just because they don’t know where to look, and SQL Server gives out so many metrics.
The good news is that this doesn’t require a whole lot of rocket science. You just need to know where to look.
Your Mission: Show evidence of when disk latency impacts performance, and how much it hurts
Do your homework collecting data from the SQL Server and (sometimes, maybe) Windows.
If it’s not “emergency slow”, look for potential workarounds, such as adding memory to reduce the amount of reads – that’s always cheaper than speeding up storage. Take care of due diligence looking at index fixes, too.
When those won’t do it, perform an analysis of exactly where faster storage would help you the most, document what it will help, and ask for help speeding up the workload.
Make all of your notes and data available to the SAN admin, but write up a short TLDR summary.
I like that you’re concerns about this already! You know this is a sensitive topic, and that things can go wrong when you bring this up.
Talk about “disk latency” that you’re seeing.
Some people say things like, “The SAN is slow.” That’s like the SAN admin saying, “The SQL Server is dumb.” Things tend to go badly after you call someone’s baby ugly (whether or not the baby is actually ugly).
In reality, the problem could be part of the N of SAN– the network. Saying “the SAN is slow” is a really general thing, and that’s part of why it’s not really helpful. I’ve had slow storage incidents that were solved by replacing a single cable.
“Emergency Slow” - Look in the SQL Server Error Log for “15 second” warnings
Filter the SQL Server log looking for messages with “longer than 15 seconds” in them.
The full message is like this:
SQL Server has encountered [#] occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [blah blah blah] in database [blah].
This message means SQL Server sent off an I/O request and waited 15 seconds (1…. 2…. 3…. ) before getting anything back.
The # of occurrences is important. Was the SQL Server even doing a lot?
This message will only occur once every 5 minutes, so if you see the messages repeat, the latency may have been continuous.
If you see these messages, this is severe latency. Stop and send the times to the storage admins right away. Note that the SQL Server had severe storage latency at these times, and ask if any scheduled maintenance was occurring.
If there was no maintenance, ask for help starting a ticket up to investigate what’s causing this. This is a severe latency problem. For the storage with the problem, start by asking:
- How many servers / computers share that storage
- How active was it at that time on the SAN side
Gather read/write latency metrics from SQL Server’s “virtual file stats” dynamic management view
Sys.dm_io_virtual_file_stats is your friend here!
You want to know about latency when SQL Server reads from storage. This isn’t all reads, because when it can read from memory alone, it avoids the trip to storage.
Look at how much is read and written to in samples when performance is poor.
- Volume of reads and writes by file (I look at MB or GB because those units make sense to me)
- Latency - how much are you waiting?
Lots of reads but very low latency is evidence that you’re using the storage, but things are going swimmingly. And that may be the case– be open to that! The beauty of this DMV is that it can tell you if the problem is elsewhere. (If so, start here.)
This DMV reports on all physical reads, including IO done by:
- DBCC CHECKDB
- Index maintenance
That means that the data since start up is diluted– it contains maintenance windows as well as periods of time when just not much was going on. It’s a data point, but what you really want to know is what does the data look like in 5 minute periods when performance was poor and when maintenance wasn’t running (unless your problem is slow maintenance).
There are a bunch of free scripts to sample this, if you don’t feel like writing your own. Here are two:
Analysis: Which files have latency, and is it acceptable?
Typically throw out samples where very few reads and writes are done. A tiny amount of reads being slow is usually not an issue.
What’s impacted the most?
- Reads from user database data files?
- Writes to user database log files?
- Reads and writes in tempdb?
- Something else?
The latency numbers tell you how severe the latency is impacting you
Some latency is acceptable. From Microsoft’s “Analyzing I/O Characteristics and Sizing Storage Systems for SQL Server Database Applications”
- Data file latency: <= 20 ms (OLTP) <=30 ms (DW)
- Log files: <= 5 ms
These are pretty aggressive targets.
Your storage is not “on fire” if you see periodic 100 ms write latency to your data files.
One point on writes to data files: SQL Server writes to memory and a write ahead transaction log. Seeing a 100ms insert on writes on a data file does not mean a user is waiting 100ms when inserting one row.
However, seeing a 100ms read latency from a data file may mean that a user is waiting on that.
Things to think about:
- Adding memory frequently reduces read latency from storage for data files
- Data file reads don’t have to be blazing fast when there’s sufficient memory to make reads from storage fairly rare and the workload is largely reads
- In many environments, 100 ms is tolerable as read latency. It’s all about performance standards.
Heavy write latency to transaction logs can usually only be sped up either by speeding up storage or (sometimes) changing application patterns if you’ve got a lot of tiny small commits.
- Both of these options are typically somewhat expensive, unfortunately
Helpful: Cover your (developer’s) butt with a sanity check on indexes
Arguably, storage shouldn’t be slowing you down, whether or not you have good indexes
However, it’s polite to do a sanity check, because indexing can dramatically reduce the number of physical reads you do (sometimes)
Note: you shouldn’t have those “15 second” latency warnings in the SQL Server error log, no matter what your indexes are.
If you’ve never done an indexing health check and can request for a developer to do a quick assessment, that’s fine too– you don’t have to do it yourself.
Sometimes helpful politically: Gather disk latency counters from Windows
DBAs are usually suspicious of the SAN, and SAN administrators are usually suspicious of the SQL Server.
If this dynamic is a problem in your environment, Windows is kind of a neutral middle ground that helps make the SQL Server info more convincing / less threatening. It will “back up” the virtual file stats data if you need that.
Latency counters on the PhysicalDisk Object:
- Physical Disk - Avg Disk sec/Read
- Physical Disk - Avg Disk sec/Write
- Talk about measured storage latency and its impact on the SQL Server. Sticking to the data points helps keep people from getting defensive / taking it personally.
- Look for “emergency slow” first, and triage those as a critical issue
- Use samples from sys.dm_io_virtual_file_stats to identify if storage latency impacts you
- Analyze it to identify which databases are impacted, and reads vs writes
- Consider whether adding memory is a cheaper way to reduce the latency
- Remember that backups and maintenance do physical IO and will show in virtual file stats
- Do an indexing sanity check if possible as part of due diligence
- Collecting windows physical disk latency counters may help “bridge the gap” for some SAN admins
What if my SAN admin doesn’t know what to do?
- Your storage vendor would love to help (for a cost) if your SAN admins aren’t sure how to troubleshoot the latency
- If the SAN is a recent implementation and what you’re seeing doesn’t live up to what was advertised, you may be able to get some of their time based on that difference (but your mileage will vary)