How We Evaluate Search Quality at Scale With LLM Judging and IR Metrics

How We Evaluate Search Quality at Commerce Scale

Search is easy to ship, but hard to know when you’ve broken it.

You can change a ranking model, deploy it, and have no idea whether it improved or worsened. Conversion rate takes days to move. Click-through rate is noisy. And asking someone to manually review results for hundreds of queries is the kind of thing that happens once, never gets maintained, and tells you nothing about what changed between this week and last.

When we started building Layers, we knew we needed a rigorous way to measure search quality that could keep pace with our rapid iteration on ranking models. What we built is an end-to-end evaluation loop that combines LLM judging, classical IR metrics, A/B experimentation, and our own analytics query language to close the feedback cycle between a model change and a reliable signal about whether it actually helped.

Why E-commerce Search Is Hard to Evaluate

Most search evaluation literature assumes you have relevance labels, which are human annotations that say “this result is good for this query.” At scale, that’s expensive and slow. User signals like clicks and purchases are the natural proxy, but they’re lagged, sparse for tail queries, and confounded by position bias.

E-commerce search has a few additional complications.

The catalog is the constraint. A query for “navy blue midi dress” might have zero relevant results, not because the search is broken, but because the merchant doesn’t carry it. A judge who doesn’t understand the catalog will score that as a failure. It’s not.

Intent is messy. A query like “gold” on a jewelry site means something completely different than “gold” on a workwear site. Branded terms, attribute-heavy queries, negation (“no leather”), and ambiguous intent all behave differently and need to be evaluated differently.

Head queries aren’t the whole story. The top 10 queries for any given store might drive 40% of search sessions, but the tail is enormous. A model optimized for head queries at the expense of tail quality is a common failure mode that aggregate metrics hide completely.

These constraints shaped every decision we made about how to build our evaluation system.

The Evaluation Pipeline

When I trigger an evaluation run, the system replays real user queries against the live search engine and has an LLM judge score the results. Here’s how that actually works.

Sourcing Queries

Queries come from one of two places. The first is real traffic: we sample the most frequent search terms from our analytics data over a configurable date window, ranging from 1 to 90 days, with a cap of 1,000 queries per run. The second is seeded sets, which are curated query lists we’ve pre-loaded for specific evaluation scenarios, such as testing attribute-heavy queries or branded searches in isolation.

Classifying Queries Before Judging

Before we send anything to the judge, every query gets classified along two axes.

The first is the volume tier, determined locally using frequency ratios relative to the top query in the set. Head queries sit at 10% or more of the maximum frequency. Torso queries range from 1% to 10%. Tail queries are everything below 1%.

The second is the semantic type, determined by an LLM in batches. Each query is labeled as branded or generic, along with any applicable modifier flags: negative intent, attribute-heavy, ambiguous intent, or synonym-heavy.

This classification matters because a ranking model can be excellent on head-branded queries and quietly terrible on tail attribute-heavy ones. Aggregate scores hide that. Bucket breakdowns reveal it.

Running the Search + LLM Judge

Each query runs through the actual search pipeline, the same code path a real user hits, and then the top 20 results get sent to an LLM judge running on AWS Bedrock.

The judge is multimodal. It pulls both text metadata (title, description, product type, tags) and product images from Shopify CDN thumbnails. Product imagery carries a signal that text doesn’t. A query for “casual” means something different visually than it does semantically, and a result that looks off-brand wildly is a worse result than a text-only judge would recognize.

The judge also receives catalog context: a description of the store, its product types, and its brands. This is what lets it distinguish “bad results” from “query outside this catalog.”

The judge scores each result set across six dimensions, all on a 0 to 100 scale. Relevance measures whether the results match the query topic. Intent measures whether the results match what the user intended to do. Attribute measures whether constraints like color, size, or material are respected. Brand measures whether branded queries return the correct brand. Negative measures whether there are clearly wrong results in the top positions. Diversity measures the appropriateness of the variety in the result set.

In addition to those six scores, the judge assigns per-product relevance grades on a 0–3 scale. Those grades are what feed the IR metrics.

IR Metrics

From the per-product grades, we compute four standard information retrieval metrics.

NDCG@10, or Normalized Discounted Cumulative Gain, is the primary one I watch. It measures not just whether relevant results appear, but whether they appear near the top, applying a logarithmic discount for lower positions. Ranking a great product third instead of first is a measurable loss, not a rounding error.

MRR, or Mean Reciprocal Rank, tells us how far down the list a user has to scroll to find the first genuinely relevant result. Recall@10 tells us how many of the relevant products in the catalog actually made it into the top 10. Estimated CTR@10 is a position-weighted click probability estimate based on known click curves.

Finalization and Report

Once all query jobs are complete, the system aggregates everything into a report. That includes average scores across all six judge dimensions, breakdowns by bucket (head, torso, tail, branded, generic, and modifier type), a histogram of score distributions, and the top and bottom-scoring queries. The bottom list is the most useful one.

There’s also an LLM-generated summary of strengths, weaknesses, and areas to investigate. That summary isn’t a replacement for reading the data. It’s a starting point, useful for surfacing patterns that aren’t obvious from aggregate numbers, like “tail attribute-heavy queries are underperforming because the model underweights exact keyword matches relative to semantic similarity.”

How the Ranking Model Is Defined

To understand what the evaluation is measuring, it helps to understand what it’s evaluating against.

Ranking models at Layers are defined as versioned JSON files. Each file declares a set of features, along with their weights, transforms, and normalization options. A simplified excerpt looks like this:

{
  "name": "default",
  "version": 1,
  "features": [
    { "feature": "text_similarity", "weight": 0.18 },
    { "feature": "image_similarity", "weight": 0.17 },
    { "feature": "keyword:title:bm25", "weight": 0.1, "normalize": "minmax" },
    { "feature": "product_metric:views_7d", "weight": 0.05, "transform": "sigmoid" },
    { "feature": "published_at", "weight": 0.1, "transform": "decay_days" }
  ],
  "options": { "normalization": "minmax", "combiner": "linear", "scoreFilterStd": 3.0 }
}

Features can include embeddings (text or image similarity), keyword signals, behavioral signals such as views, sales, and purchase rate across various time windows, or product properties such as recency. A model compiler takes these definitions and produces the final hybrid score.

Models compose into pipelines. A pipeline specifies the candidate generation strategy, which ranking model to apply, and an optional second-pass rerank model. That separation matters because candidate generation and ranking are distinct failure modes with different fixes.

A/B Experimentation

Changing a feature weight and running an evaluation tells you how the model performs against an LLM judge. That’s necessary but not sufficient. You also need to know what happens with real users.

Experiments at Layers are defined as JSON files alongside the models. Each experiment specifies a baseline model, a candidate model, a traffic split percentage, and which stores its active on. The assignment is deterministic based on the session ID, so a user sees consistent results across their session, and we can reproduce exactly which model served any given search request.

Every search request is logged with the experiment ID and experiment group fields, which means the evaluation and analytics systems are unified. You can measure LLM judge scores by experiment group and business outcomes by experiment group using the same query language.

LayersQL: Closing the Loop with Business Metrics

LLM judge scores and IR metrics tell you about search quality. Business metrics tell you about business impact. We built LayersQL, our own analytics query language derived from ShopifyQL, to make it easy to query search data without digging into the underlying schema.

A typical experiment analysis looks like this:

FROM search_text
SHOW SUM(total_sales), SUM(quantity_purchased), AVG(click_rate)
GROUP BY experiment_group
SINCE -30dFROM search_text

The syntax will feel familiar to anyone who has used ShopifyQL. Experiment ID and experiment group are first-class dimensions in every dataset, so any analytics query can be segmented by experiment arm without any extra work.

How It All Fits Together

The full loop looks like this. We define a ranking model in JSON and create an experiment that routes some percentage of traffic to it. The experiment assigner deterministically splits sessions between baseline and candidate, logging the assignment on every search request. We run an auto-evaluation, sourcing real queries from our analytics data, classifying them, executing the search pipeline, and having the LLM judge-score the results with full catalog context. The evaluation report provides NDCG@10, MRR, and six judge dimensions, broken down by query bucket. We then run a LayersQL query to check whether the candidate group is seeing different business outcomes in terms of revenue, units, and click rate. If both signals point in the same direction, we proceed with the experiment. If they diverge, we investigate why.

One feature worth calling out is the “reuse” evaluation mode. Instead of re-fetching queries, it re-evaluates the same query set from a previous run against a new model configuration. That makes comparison clean: same queries, different ranking, comparable scores. The comparison report compares them query-by-query, counting improvements, regressions, and unchanged queries based on score-delta thresholds.

What This Has Changed

Before we had this system, ranking model changes were essentially done in the dark. You’d make a change, watch the conversion rate for a week, and try to infer whether it helped or was just noise. The iteration loop was slow, and the signal was unreliable.

Now we can run an evaluation, get NDCG and judge dimension scores broken down by query type, compare them against a baseline, and have a reasonably confident signal before we touch live traffic. The experiments layer means we can ship to a traffic split first and graduate only when both the evaluation and business signals agree.

The multimodal judge has been the biggest surprise in practice. There are product results that look fine on paper, right category, matching keywords, that the judge correctly flags as wrong because the product image is obviously misaligned with what the query was asking for. Text-only evaluation would miss that entirely.

The bucket breakdowns have been the second-most valuable part. Tail query performance is invisible in aggregate metrics. Breaking results out by head, torso, and tail has surfaced model behaviors we would have shipped without noticing, including one case where a weight change that improved NDCG@10 overall was quietly degrading tail-branded queries by 12 points.

Search evaluation is one of those infrastructure investments that doesn’t feel urgent until you’ve made a bad deployment and spent a week figuring out why conversion dropped. We’re glad we built it early.