Learn to Rank = Learn to be Humble

Daniel Tunkelang
3 min readMar 24, 2024

As a search consultant, I work with lots of organizations that strive to improve their search applications. I tend to emphasize query understanding and content understanding as target-rich areas with high potential returns. But often their top priority is to improve the machine-learned ranking of results, also known as learning to rank (LTR).

I do what I can to help my clients improve their ranking. Much of that work involves developing a collection of ranking factors, roughly divided into query-dependent relevance factors and query-independent desirability factors. There is the matter of choosing a ranking model (e.g., XGboost) that combines these factors. And there is the bigger challenge of collecting data from searcher behavior or human judgments to train the model.

But my biggest challenge is managing expectations around the potential gains from investments in ranking.

Size matters. So does representativeness.

Many companies do not have massive amounts of behavioral data or human judgments to learn from. Even when they do have a decent quantity of traffic, it often has a Zipfian or power law distribution, in which case the majority of the traffic is concentrated in a relatively small set of unique queries. For many of those queries, engagement tends to be concentrated in a handful of top-ranked results. The same tends to be true for human judgment data, since the expense of human judgments usually leads to prioritizing labels for top-ranked results of frequently issued queries.

As a result, the quantity of training data often overstates its utility. For example, if 30% of the traffic is on the top 100 queries, and if the results for those head queries are already highly optimized (as is likely the case), then a ranking model is unlikely to learn anything new or useful from that traffic. In contrast, the model will learn more slowly about infrequent tail queries, and especially about low-ranking results for those queries.

To be clear, a model does not need to have seen a particular query — let alone a particular query-result pair — in its training data in order to have learned to score and rank it effectively. Indeed, the whole point of machine learning is to learn a model that generalizes beyond its training data.

However, the training data does need to be representative of the data to which the model will be applied. Specifically, the joint distribution of feature vectors and labels needs to be representative of the targeted traffic. And the concentration of traffic in the head of the distribution makes it likely that the model will have blind spots in the tail.

Learn to be humble.

Unfortunately, the tail is where there is often the most opportunity to improve ranking. After all, if there were a problem with a head query, it would probably have been fixed already, possibly even by hand.

What does this mean in terms of managing expectations? For me, it means communicating that the bottleneck for improving ranking is usually going to be the available training data — not just the quantity, but the alignment of its distribution with the distribution of opportunities for improvement. If the training data comes from searcher behavior, as is most often the case, then much of it will be concentrated in the head of the head — that is, in the top-ranked results for the most frequently issued queries.

Of course, it may still be possible and worthwhile to invest effort in improving ranking. But it requires attention to detail, particularly how data collection is sampled or stratified to ensure adequate coverage across the traffic distribution. And, since training data is likely to be the bottleneck, it is usually better to start with simpler models and fewer ranking factors before trying to train a complex model whose capacity far exceed the amount of available signal.

Even a highly talented team of machine learning engineers with sophisticated models and abundant compute resources has to face the reality that a model can only be as good as the data used to train it.

So, by all means, learn to rank. But first, learn to be humble.

--

--