Search Query Understanding


The journey of a search query through e-commerce engineering stack can be broadly divided into following phases, search query text processing phase, retrieval phase where relevant products are fetched from indexer and the last but not the least, product re-ranking phase where a machine learning ranking engine re sorts the products primarily based on combination of KPIs like click through rate, add to cart rate, checkout rate etc. The focus of this post would be primarily on the first phase i.e. query text processing via a Query Understanding Service (QUS). I would be discussing the applications and working of QUS in e-commerce search. QUS is one of the most critical service needed to resolve user query and find the key search intent. Among the plethora of machine learning(ML) services working across the engineering stack in any e-commerce company, QUS is usually the first to hold the fort and acts as the backbone ML service in pre retrieval phase.

Common Issues

QUS will help move search beyond raw keyword match. In a world without QUS following issues can occur if we depend on pure SOLR retrieval

  1. Retrieval Sensitivity :

Search Phases

We can divide the e-commerce search process in two phases

Post Retrieval :

This phase is concerned with the retrieved relevant results corresponding to the query. It is here where we rank the results, add recommendation, add sponsored products to the final list of results.

Pre Retrieval :

This is the phase where we haven’t yet retrieved results from backend indexing engine yet. The only information we got to deal with is the raw user query. Usually this phase comprises of following components (can be separate microservices)

Query Classifiers

Category Classifier: This service would classify the query into leaf level categories corresponding to the catalog structure/taxonomy. The output of this service would be a set of leaf level categories. Solr can either filter the results on the basis of predicted set of categories or boost the results for products belonging to predicted category set.

Query Tagger

  1. resolving long tail queries
  2. query rewriting and query relaxation
  3. product promotion / recommendation
Architecture: Bidirectional LSTM-CRF Models for Sequence Tagging
Tutorial :
Global Attributes

Attribute Normalization

Once the tagger detects a mention in query and tags individual tokens the next step involves normalizing these tokens. For attributes types like color, normalization is pretty straight forward e.g. {red, light red, pink, .. etc} can be mapped to one color family with normalized name RED, similarly for price too we can create a standardized denomination using a set of regular expressions. With normalization we are aiming to standardized the attribute key/value pair w.r.t the values in catalog. Here the prior requirement is that products in catalog would have canonicalized values for attributes e.g. all men shirts would have size attribute mapped to only a predefined values {XS, S, M, L, Xl, XXL ..}. Now once we detect size attribute in a query like “mens small checked shirt”, the next step is to normalize the size token “small” to normalized attribute value in catalog “S”. This would help us in either making a faceted query to SOLR or boost products in retrieval where size attribute is “S”, thereby enhancing the retrieval quality.

  1. Rule based methods

Entity Linking Based Brand Normalization

Let’s say we have a mention string tagged as Brand in the search query. The entity linking task can be broken into two steps: candidate generation and candidate ranking. Candidate generation means fetching normalized brand candidates that are syntactically or semantically similar to the mention string, e.g for search query “paw-patrol fire truck”, the tagger would generate mention for Brand as “paw-patrol” and the candidate generation phase can find a set of syntactically similar brands from catalog for category Toys. Traditionally an information retrieval based approach for candidate generation has been used like BM25, a variant of TF-IDF to measure similarity between mention string and candidate brands and their description. Once we have a set of candidates we can rank them in the Candidate ranking phase.

Brand Resolution In A Noisy Catalog

It may happen that brand names are not normalized in the catalog. In this case some brand e.g. Coca-Cola can be referred by different products in catalog using different variants e.g. coke, coca cola, coca-cola, coca + cola etc. Here we can’t normalize brand in query since brand names in catalog aren’t normalized. So instead of canonicalizing the brand in query we should aim to fetch products that refer to any variant of the searched brand.

a. Trigram-Jaccard

b. Levenshtein Edit Distance

c. JaroWinkler

d. FuzzyWuzzy


Connecting the Dots



Machine Learning Tech Lead Amazon

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store