Language Identification

Published in

Query Understanding

3 min readNov 5, 2016

Queries are the primary way that searchers communicate with search engines. But it’s difficult to get started with query understanding until the search engine identifies the language in which the query is written.

History

Automatic language identification has seen several decades of work, going back to research in the 1970s to automatically identify the language of written or spoken text based on the frequencies of letter combinations or sound segments. Work on statistical analysis to determine language and authorship goes back even further to Udny Yule, George Zipf, and Jean-Baptiste Estoup.

Given the importance that language plays in society, it’s not surprising that this topic has been the object of sustained study, particularly among linguists and statisticians. Most of that study predates modern search engines.

Search Queries are Short

In general, statistical approaches to language identification take advantage of distributional characteristics of the text they analyze. As a result, the accuracy of these approaches is a function of the amount of text they analyze. As Arjen Poutsma showed in his 2001 paper on “Applying Monte Carlo Techniques to Language Identification”, a wide variety of language identification algorithms perform poorly for inputs of fewer than 50 characters.

This result is unsurprising: it’s difficult to obtain a meaningful distribution from a short text sample. Unfortunately, search queries tend to be short.

Using Signals Beyond the Query

In 2009, Hakan Ceylan and Yookyung Kim wrote a paper on “Language Identification of Search Engine Queries”. In it, they described how they used clickthrough logs from Yahoo’s search engine to identify the language of queries based on the language of results searchers clicked for those queries.

Their approach took advantage of the relative ease of identifying the language of longer documents: in their data, the average search result contained almost 500 words, far more than the average search query. Even a single result could provide a robust signal, and most search queries were associated with clicks on multiple results.

They also noted that the language of the query terms might not be the same as the desired language of the results. For example, someone who searched for homo sapiens was probably not looking for documents written in Latin. To address these kinds of queries, they introduced a non-linguistic signal: the country from which the searcher was accessing the search engine.

But geography, while useful, cannot be trusted as the only signal. For example, Ceylan and Kim found that a quarter of the English queries in their data came from countries whose primary language was not English.

Hence, Ceylan and Kim approached language identification as a machine learning problem, using a combination of query features, document features, and the searcher’s country. Their best classifier achieved an average of 94.5% accuracy across 10 European languages.

When in Doubt, Ask!

No language identification approach is perfect. Even an accuracy in the mid 90s means that there will be frequent errors. Besides, it’s never a good idea to rely entirely on algorithmic approaches when it’s easy to ask the searcher for clarification.

If your search engine serves a multilingual population, make it easy for the searcher to specify his or her language. Don’t hide this functionality behind an undiscoverable advanced search page. And if your algorithmic language identification exceeds some uncertainty threshold, offer the searcher a clarification dialogue to specify the desired language.

Further Reading

If you’d like to learn more about this topic, I recommend the 2014 “Survey of Language Identification Techniques and Applications” by Archana Garg, Vishal Gupta, and Manish Jindal, It discusses current work in this area, as well as a summary of the accuracy of various methods. It’s a well-written overview and serves a useful annotated bibliography.

Previous: Introduction

Next: Character Filtering

Language Identification

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Query Understanding

Written by Daniel Tunkelang

No responses yet

More from Daniel Tunkelang and Query Understanding

AI-Powered Search: Embedding-Based Retrieval and Retrieval-Augmented Generation (RAG)

Replacing traditional search with AI-powered search means embedding-based retrieval and possibly retrieval-augmented generation (RAG).

Query Understanding: An Introduction

Search engines are so core to our digital experience that we take them for granted. Most of us cannot remember the web without Google to…

Query Expansion

A key application of query rewriting is increasing recall — that is, matching a larger set of relevant results. In extreme cases…

Ground Truth: A Useful Fiction

While we should not accept AI-generated outputs uncritically, we must acknowledge that our human-derived “ground truth” is a useful fiction.

Recommended from Medium

RAG for E-commerce: Enhancing Product Recommendations and Customer Support

In the fast-paced world of e-commerce, staying ahead of the competition means leveraging cutting-edge technologies to improve customer…

Rediscovering Query Expansion: The Classic Technique Powering Modern AI Searches

Boost Your Search with Query Expansion, PRF & LLMs: Enhance Accuracy Using Classic Techniques and Modern AI Strategies

Lists

data science and AI

My Kind Of Medium (All-Time Faves)

Natural Language Processing

Medium's Huge List of Publications Accepting Submissions

Jeff Bezos Says the 1-Hour Rule Makes Him Smarter. New Neuroscience Says He’s Right

Jeff Bezos’s morning routine has long included the one-hour rule. New neuroscience says yours probably should too.

You are an absolute moron for believing in the hype of “AI Agents”.

All of my articles are 100% free to read. Non-members can read for free by clicking my friend link!

The 5 paid subscriptions I actually use in 2025 as a Staff Software Engineer

Tools I use that are cheaper than Netflix

Movie Recommendation System using Vector Search and Similarity

What we are gonna make today 🍯