This week’s question is about a longstanding feature in SQL Server that sounds really cool: full-text search. If you’re learning performance tuning, how much time should you invest in researching and learning about full-text indexes?
Watch this 18 minute video, or scroll on down to read the written scoop on full-text search.
Dear SQL DBA…
I have been doing performance tuning for about 9 months now. It puzzles me that one type of index never gets much attention: full text indexes. Are fulltext indexes a cool feature that can really help performance (all that LIKE ‘%blabla%’ predicates application developers seem to love :-) ) or are they quite the opposite and not worth investing time in ? Best regards, Puzzled about fulltext
The “dirty little secret” about full-text search indexes is that they don’t help with ‘%blabla%’ predicates.
Well, it’s not a secret, it’s right there in the documentation.
A lot of us get the impression that full-text search is designed to handle “full wildcard” searches, probably just because of the name. “Full-Text Searches” sounds like it means “All The Searches”. But that’s not actually what it means.
What is full-text search good for?
Full-text indexes can help with:
- Prefix searches. It’s good for ‘bla%’
- Phrases containing words. So it’s good for ‘So blabla to you’
- Different forms of a word / synonyms (is there a synonym for blabla? I don’t know!)
- Words near one another in a document (‘bla’ is in a document in proximity to ‘blop’)
Full-text search also has special features like stoplists and stopwords to keep the index from becoming more bloated than it has to be, and help searches be more efficient.
One way to think about this is that full-text search is designed to be smart about language: it thinks about phrases, synonyms, how words are used, things like that.
A pure wildcard search of ‘%blabla%’ isn’t really about language. That’s just looking for a pattern somewhere in a string.
For wildcard searches and regular expression queries, secondary applications like Lucene are attractive, and these days in the cloud there are options like Lucene Query in Azure Search.
Aside: Azure Search is easy to play with for free
A while back I wrote a post called Wildcard vs Regular Expressions – Lucene Query in Azure Search.
It shows how easy it is to play around with texting non-sargable wildcard searches like ‘%blabla%’ against online sample data in Azure. All you need is a browser, it’s totally free and you don’t even have to create an Azure account.
Fulltext indexes and performance
I’ve run into quite a few companies using full-text search. Most of them were using it pretty lightly, and it rarely was something they asked me for help with: they set it up following the documentation, and it just worked. There were quite a few cases where I’d say something about seeing a full-text index when looking over an instance, and my client laughed and said they’d forgotten they even used full-text. (If you think about it, that’s a compliment to the feature.)
I’ve also run into some folks who’ve used full-text search so heavily that they pushed the boundaries of the feature: very large multi-terabyte databases pulling in large volumes of data.
Keeping data in sync with heavy update rates
With heavy to ultra-heavy usage, one issue with full-text indexes is that they don’t update synchronously with the base table. This is helpful for performance for inserts, updates, deletes into the base table, because updating a large full-text index can take time. But it does mean that if your application allows both queries of the base table AND the full-text index, people could see different, contradictory data if the full-text index is behind.
What if corruption strikes?
And as with any other index, you can get corruption in a full-text index. That’s not necessarily the SQL Server’s fault: corruption can come from the storage subsystem. If your full-text index gets corrupt, you’re probably going to have to rebuild it.
If you’re working with giant full-text indexes, recreating the index can add up to a lot of downtime. Thinking about how your tables are laid out and breaking your indexes into manageable chunks becomes very important at scale.
I think full-text search is here to stay, it’s just getting interesting company
This is an older feature, so there’s always that question as to how “fresh” it is.
Microsoft has invested in making full-text indexes perform better over the years. The feature was revamped in 2008 and has received a variety of performance fixes over the years. A new DMV, sys.dm_fts_index_keywords_position_by_document , was added in SQL Server 2016 and also backported to previous versions.
Full-text search is well maintained by Microsoft. I don’t think it’s going anywhere.
What about semantic search?
In SQL Server 2012, Microsoft added the semantic search feature built on top of full-text search. Semantic search helps identify the main phrases in a document and can find and compare similar/related documents.
Semantic is one of those features that dropped in and then seemed to disappear from the conversation, though.
I haven’t heard of its capabilities being strongly expanded in later versions, and I know people who evaluated it in SQL Server 2012 who found it to be too much of a “v1 feature” to fit their needs, compared to features offered by third-party vendors with semantic search tools. (Of course, they were evaluating native semantic search because not everything was perfect with their third party app, either.)
Here is one such investigation into semantic search by Joe Sack - Exploring Semantic Search Key Term Relevance.
If you use semantic search in production and know about improvements that I’m unaware of, I’d love to hear about it in the comments!
How much time should you invest in learning full-text indexes?
To sum up, full-text indexing is fairly widely used, but most of the folks using it are doing so on a small scale where it “just works.” Those companies are unlikely to have a high bar on full-text index skills when it comes to hiring, and they may not even ask you questions about it at all in a job interview.
For most folks, I think it’s worth knowing the basic limitations of full-text and what the feature does.
A one-time investment of an hour to read and make notes for yourself is generally enough to get you to a point where you can identify potential use cases. If you ever find those use cases, at that point you can invest more time in evaluating how well full-text fits that implementation.
After getting the big picture from this post, reading the Books Online page on full-text search is probably good enough for most people. That’s where I’d spend the rest of your hour.
After that, I wouldn’t invest a bunch of time learning about full-text indexes unless you’ve got a specific reason. You’re better off investing your time learning about wait statistics, tuning TSQL using execution plans, rowstore indexes, columnstore indexes, Query Store, and In-Memory indexes.
Some fun related topics: building your own type of full-text index, and querying with regular expressions
Aaron Bertrand writes about building your own word-part index: One way to get an index seek for a leading %wildcard
Dev Nambi created an open-source project, sql-server-regex, that uses the SQLCLR “lets you run regular expressions in T-SQL queries using scalar and table-valued functions.” I know for a fact that Dev is crazy good at this stuff, because I worked with him for several years out there in the real world. He’s a unicorn.