It can be pretty difficult to manage statistics in data warehouses, or even OLTP databases that have very large tables.
This is because, even with auto_update_statistics turned on, SQL is pretty conservative about when to update statistics due to the cost of the operation. For large tables, statistics are updated when “500 + 20% of the number of rows in the table when the statistics were gathered” have changed. (see BOL here)
So for a table with 50 million rows, statistics will auto update when more than 10 million 500 rows have changed. I have a lot of tables with a lot of rows, and this can be a problem. Take a fact table, for instance, where the key is sorted by an integer representing a date. Every day, a large amount of new records are loaded and there is suddenly a lot of records for a new day. Typically though, the number of rows changed in one day in these large fact tables is not enough to trigger automatic updating of statistics.
You don’t want to update statistics more often than you have to, however. A schema stability lock is put in place, and depending on what you’re doing it can cost a bit in IO. You want to learn how often you need to manage your statistics in a given set of tables, and also have a guideline to use to ensure you’re staying within where you need to be over time.
Comments