Last week I was working on a side project of mine were we needed to do a text search through a bunch of titles. The quick and dirty way to solve this was to use a %LIKE% in the database WHERE clause. This is ofcourse a very crude way of searching, but for the initial implementation it was good enough. But then my mind started to wander, how can we improve this? My experience with ElasticSearch gave me an idea to use n-grams for this.
So we have a collection of titles ( in this case product names ):
Now we have a searchterm say
potato. Searching through this collection with a like statement will return the following set:
This result is pretty decent, the only problem here is that the ordering is essentially random. It depends on the order the data is stored in the database.
Now let’s say the user types in pototo, using the %LIKE% query this will not return any matches. What we want is that even when the user makes a typo relevant results are returned. Also we want to know what the best search result is. One way to do this is to use n-grams.
So what are n-grams, luckily we have wikipedia which provides a good starting point.
In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items can be phonemes, syllables, letters, words or base pairs according to the application. The n-grams typically are collected from a text or speech corpus. When the items are words, n-grams may also be called shingles.
N can be any number, that means we have a 1-gram, 2-gram, 3-gram etcetera. A 2-gram for the word potato we used before will look like this.
po - ot - ta - at - to
A 3-gram like this:
pot - ota - tat - ato
For this example we will have 5 terms to search through: aardbei, aardappel, kandei, haard and, baard (these words are dutch :) ). The first step will be to split the terms into n-grams. To keep it simple we will only be looking at 2-grams. In a real world scenario you probably want to use an combination of different n-grams. Also every term will be given an id to make it easier to reference them.
|1||aardbei||aa, ar, rd, db, be, ei|
|2||aardappel||aa, ar, rd, da, ap, pe, el|
|3||kandei||ka, an, nd, de, ei|
|4||haard||ha ,aa, ar, rd|
|5||baard||ba, aa, ar, rd|
So let’s say we want to search for the word baard. The first step is to split the search term into n-grams. This will lead to the following the n-grams: ba, aa, ar, rd. For every n-gram we can look into the above table to find out which id it matches. Doing that for every n-gram will lead to the following table.
To find out what the best match we can count the times an id is in the table. In this case there are only a limited number of entries so it easy to see that id number 5 has 4 entries en the other numbers only 4. This means
baard is the best search result. This table can also be used to rank the different search results.
Using n-grams it easy to setup an text search. In the next part we will implement an custom search index, so we can use in a web application.