Wildcard vs Regular Expressions - Lucene Query in Azure Search

on October 25, 2016

Sometimes clever developers let you try out cloud services with very little work on your part. This is one of those times - get ready for “Playing with Azure Search: No Installation Required.”

I recently saw a question about Azure Search on Twitter:

Guys, is Azure Search capable to do a substring search, i.e. like ‘%bian%’ and find a record with a ‘Fabiano’ string in it? #sqlhelp

That’s a great question

I was immediately curious about the answer – and I immediately wanted the answer to be ‘yes’.

Someone else quickly responded to the tweet and explained that Azure Search does support wildcard searches, but it’s only for prefix searches. So if you’re looking for jobs that contain the word ‘Architect’, your query could say: show me all job titles where there’s a word leading in Arc*.

That’s true, but wildcard searches aren’t the only way to look for strings.

foo-finders

I wanted to be able to find all architect jobs using something like ‘%rchit%’ as well, because there’s not a lot of great ways to do this in SQL Server.

In SQL Server, you can use a traditional B-Tree index to seek, but only based on the letters at the beginning of a character column.  If I want to know every business title that contains ‘%rchit%’, I’m going to have to scan an entire index.

SQL Server fulltext indexes don’t solve the double-wildcard problem, either. Fulltext indexes support word prefix searches– so a fulltext index would be great at finding all job titles that contain a word that starts with ‘Arch%’.

Sometimes that’s enough. But a lot of times, you do need to find a substring anywhere in a word. And sometimes you do want to offload that from your database.

Azure Search supports regular expressions

I did some quick searching and found a very helpful article from Microsoft’s Liam Cavanagh, Lucene query syntax examples for building queries in Azure Search.

There’s all sorts of cool stuff in the article. The examples are a hoot, because you can click on them to play with them in a browser.

You can also modify the examples and see what you can get to work!

Example 8 in the article is for regular expressions. It includes an example query to return all job titles that contain Senior or Junior by plugging in the regular expression, business_title:/(Sen|Jun)ior/).

I had one little issue because of the extra end parentheses that I highlighted in red. As soon as I tweaked the example to the following, it worked great in my browser.

/* Code example from Lucene query examples, last right paren removed */

http://fiddle.jshell.net/liamca/gkvfLe6s/1/?index=nycjobs&apikey=252044BE3886FE4A8E3BAA4F595114BB&query=api-version=2015-02-28-Preview%26queryType=full%26$select=business_title%26search=business_title:/(Sen|Jun)ior/

Here’s what the results look like in Chrome:

But can we go full wildcard?

Now that I got the example working, can I see all the job titles that contain %rchit% using the regular expression search?

I don’t use regular expressions a lot, but when I do I just cheat and look up syntax on the internet.  There may well be a better way to do this, I looked for the first thing that world.

Behold:

http://fiddle.jshell.net/liamca/gkvfLe6s/1/?index=nycjobs&apikey=252044BE3886FE4A8E3BAA4F595114BB&query=api-version=2015-02-28-Preview%26queryType=full%26$select=business_title%26search=business_title:/(.*)rchit(.*)/

The results:

I’m no Architect, but that looks like it worked to me. Awesome!

I’m gonna guess that this can be slow, too

There’s probably a good reason that wildcard searches make it simple to look for words by prefix, and you have to dig around in regex to get a substring: performance. I’m just guessing here, but we’re probably doing something similar to scanning an index in Azure Search (if that’s not exactly what we’re doing).

But having a supported way to do this in Azure Search is still awesome– it has different scale and licensing costs than doing the scanning in SQL Server. It’s all about having options.

This is fun - play with it

Even if you’re not going into the cloud tomorrow, this is really easy and fun to play around with. Grab one of those URLs and enjoy searching.

If you’re already an Azure Search user, I’d love to hear about your experience in the comments!