Autocomplete
In the past decade, autocomplete has become a required feature for search engines. Today, searchers who type into a search box expect to see autocomplete suggestions; otherwise, they’ll conclude that search is broken.
Autocomplete isn’t just a way to help searchers type less. Autocomplete helps searchers express their intent. When autocomplete works as intended, query understanding guides the process of typing a search query.
Query Probability
Autocomplete predicts complete search queries from partial ones. We can model this process using conditional probability: given a partial query, we want to offer searchers the most probable query that completes it.
We can use historical query logs to compute the probability of a query based on how frequently it appears in the log. For example, if our log contains a 1,000,000 queries and 1,000 of them are for pants, then the probability of pants is 1,000/1,000,000 = 0.001.
That’s the a priori probability. As the searcher types in more characters, the conditional probability of a query either increases (because the denominator decreases) or goes to zero because the query doesn’t start with the entered prefix. Continuing with our example, if 50,000 of the 1,000,000 queries start with pa, then Pr(pants|pa)=1,000/50,000=0.02. But Pr(pants|pat)=0. Or does it? We’ll get back to that in a moment.
Not as Simple as it Looks
Computing probabilities this way is simple, but reality is a bit more complex. Here’s a taste of that complexity:
- Query frequency in the logs doesn’t take into account recency, seasonality, or other time-dependent factors. There are algorithmic and editorial strategies to emphasize recency or other time-dependent factors, but they all have trade-offs. There’s no one right approach.
- There are also non-temporal factors: we may want to build a regression model using features like location, gender, session context, etc. These approaches generally lead to more accurate probability estimates, but they also increase the risk of overfitting.
- The partial query may not be a prefix, for example, we shouldn’t assume Pr(mens pants|pa)=0 just because pa isn’t a prefix of mens pants. We should strongly prefer prefixes, but we should allow for exceptions.
- The partial query may be misspelled, e.g. something typing in pat may intend to type pants. In that case, it’s necessary to combine autocomplete with spelling correction — something we’ll cover as an advanced topic in a later post.
- You may want to exclude autocomplete suggestions from the logs when computing query probabilities, in order to avoid a positive feedback loop.
Query Performance
Query popularity tells us how often people express their intent through a particular search query. But it doesn’t tell us how often we manage to understand that intent and deliver a successful search experience.
We can use clicks or actions (e.g., purchases in the context of a shopping site) as indicators of search success. We’ll save a deeper discussion of query performance for another post, but for now let’s assume that clicks indicate successful searches.
Then, instead of counting all of the times a query appears in the log, we can count only the queries that lead to clicks. In other words, we choose the autocomplete suggestions that maximize the conditional probability of a click, given the entered prefix.
A Trade-Off
We can think of the conditional probability of a click given a prefix as the product of two factors:
- The conditional probability of the query, given the prefix.
- The conditional probability of a click, given the query.
Continuing our earlier example, if the query pants has a click-through rate (CTR) of 0.1, then the probability of a click for pants given the prefix pa is 0.02 * 0.1 = 0.002.
Let’s add another query to our example: the query pant cuffs with only 100 occurrences in the log (a tenth of pants), but a CTR of 0.5. The probability of a click for pant cuffs given the prefix pa is 0.002 * 0.5 = 0.001.
The probability of a click for pants given the prefix pa is double the probability of a click for pant cuffs. Yet there’s something disconcerting about favoring a query with a CTR of 0.1 over a query with a CTR of 0.5, just because the lower-performing query is much more popular.
The risk of favoring low-performing popular queries is that we’ll lead the searcher to an unsuccessful experience. If we instead favor queries that perform better but are less popular, we may require searchers to type more, but we’re less likely to lead them astray.
There’s no one answer on how to manage the trade-off between query popularity and performance. But it’s good to at least set a minimum threshold of query performance for autocomplete suggestions.
Summary
Autocomplete isn’t just about helping searchers type less. It’s a way to guide searchers to successful search experiences. It’s easy to build a basic autocomplete system, but we’ve seen there’s a lot more to computing query probabilities and taking into account query performance. Building a robust autocomplete system can be a major endeavor.
But remember that, for a large fraction of your searchers, autocomplete is the entire search experience. Invest accordingly.
Previous: Taxonomies and Ontologies