NEWSLayers closes first external funding round led by LOI VentureRead more
‹ All Articles

How to Evaluate an AI Shopping Agent on Your Catalog: the Golden-Query Method

Jake Casto19 min read

Key Takeaways

  • Watching an AI shopping agent answer a handful of queries in a demo is not an evaluation. You need a fixed query set, a written rubric, and a baseline to compare against.
  • The Golden-Query Method is a repeatable four-stage loop: build a 100-query catalog-specific golden set, score it on a five-axis rubric, A/B the agent against your current search, and re-score after every change to catch regressions.
  • A good golden set is not random. It is sized by query volume (head, torso, tail) and intent type, and it includes out-of-catalog control queries the agent should not answer confidently.
  • Even frontier models fail most real shopping tasks in published benchmarks, so a confident demo tells you nothing about your catalog.
  • Re-score the same set after each change and gate on the tail buckets. A single aggregate score hides the failures that reach shoppers.

A few weeks ago I watched an AI shopping agent nail "red running shoes," then hand back a hiking boot for "trainers for flat feet under 100." Ninety seconds earlier the demo had looked flawless. That gap, between the query an agent rehearsed and the query a real shopper types, is the whole problem.

Every team I talk to is being asked the same thing right now. Should we trust an agent on our storefront? Almost everything published about it is a vendor pitch or an "agents are coming" think-piece. Nobody hands you a repeatable way to answer the one question that matters before you ship: does it actually work on my catalog?

This is that method. We call it the Golden-Query Method, and it is vendor-neutral on purpose. Run it on whatever you have today, on the agent someone is selling you, on Shopify native search, on us. It is the merchant-facing version of the discipline we run internally, which I wrote up in how we evaluate search quality at scale.

Why isn't a demo enough to evaluate an AI shopping agent?

Evaluating an AI shopping agent means scoring it against a fixed, catalog-specific query set with a written rubric, then comparing it to your current search and re-checking after every change. A demo is not an evaluation, because it only shows the queries the agent already handles, not the regional, attribute-heavy, and out-of-catalog queries that reveal failure.

Every shopping-agent demo looks great, because a demo is a rehearsal. Real catalogs break somewhere off the script: regional terms, stacked attributes, natural-language intent, and the queries your catalog genuinely cannot satisfy.

The research backs the stakes, and it is not close. On ShoppingBench, an intent-grounded benchmark built on 2.5 million real products, even frontier models clear under half of real shopping tasks. On ShoppingComp, a harder benchmark of compositional shopping scenarios, frontier models score in the teens.

Both run on someone else's sandbox, not your catalog. That is exactly why you cannot borrow their numbers. A confident answer is not a correct one, and the only way to know which you are getting is to score it on your own products.

None of this is new in ecommerce search; agents just raised the stakes. Baymard Institute has logged more than 700 search usability issues across multi-million-dollar sites, and NNGroup has tracked on-site search success rates over seventeen years of testing. Search quality has been a measurable, fixable problem for a long time.

The agent does not change that question. It answers more queries, faster, with more confidence, which is precisely why you need a fixed way to check it.

I learned this the hard way building our own evaluation, where a result would come back looking perfectly reasonable on paper, the right category, the matching keywords, a clean-looking set of ten products, and the multimodal judge would flag it as wrong because the lead image was obviously off from what the query asked for. You catch that when you score against a fixed set. Never in a demo.

The objection I hear most: "We'll just watch the conversion rate." Conversion moves slowly, and it is confounded by promotions, seasonality, traffic mix, and a dozen other things that have nothing to do with the agent you just shipped. By the time the number finally drops far enough to notice, the shoppers who hit the broken results are already gone. Pre-deployment scoring catches the break first, which is the entire point of having a method.

The demo gap

Demo query: "red running shoes" → correct product, every time. Real query: "trainers for flat feet under 100" → confidently wrong product. The second query is the one your shoppers actually type. It is also the one no demo rehearses.

A golden query set is a fixed, curated list of real shopper queries you run against a search system every time you evaluate it, so results are comparable across runs and systems. It is engineered, not random. A random sample over-weights easy head queries and misses the tail, attribute-heavy, and out-of-catalog queries where agents fail.

The word "fixed" is doing the work here. Run the same queries every time and two runs are comparable, two systems are comparable, this quarter is comparable to last quarter, and a model you are being sold is comparable to the one you already have. Change the queries and you have thrown away your entire baseline.

A random sample of 100 search-log queries fails for a specific reason. It over-weights head queries, the ones every engine already gets right, and under-samples the tail and the edge cases where agents actually break.

So you measure the part that works and miss the part that doesn't. The set has to be engineered to cover the failure surface, not sampled to mirror traffic.

Three properties make a set "golden" rather than just a query list:

  • Curated and stable. Internally we call these seeded sets. The merchant-facing equivalent is your golden set: the same queries, run after run, never refreshed mid-evaluation.
  • Sized to see a pattern. A hundred queries is enough per bucket to spot a real failure mode, and small enough to score by hand or with a judge in an afternoon.
  • Reusable forever. Once it is built, it becomes your standing regression suite, not a one-time audit you run and forget.

If you want the industrial version of this, the same head, torso, and tail bucketing drives our internal loop in how we evaluate search quality at scale.

How do I build a 100-query golden set for my catalog?

Build a 100-query golden set across two axes: volume tier (head, torso, tail) and intent type (exact-product, attribute-constrained, natural-language, comparison, synonym/regional, negation, branded, ambiguous, and out-of-catalog). Over-weight the tail, write acceptance criteria for each query, include 10 to 15 out-of-catalog controls, then freeze the set so every future run is comparable.

A defensible set covers two dimensions at once. Volume tells you whether a failure is loud or quiet. Intent type tells you which capability broke. Build the grid first, then fill each cell from your own search logs and catalog, pulling the real strings people type, the misspellings and the regional words and the half-formed natural-language questions, not the tidy queries you wish they typed.

  1. Split by volume: head, torso, tail. Pull your top search terms from analytics. Of the roughly 85 in-catalog queries, head queries (the top decile by frequency) get about 22, torso about 30, tail about 33. You over-weight the tail on purpose, because that is where agents quietly fail and where aggregate metrics hide the damage.
  2. Split by intent type within each tier. Cover the full surface: exact-product or SKU lookup, attribute-constrained ("waterproof hiking boots size 9"), natural-language intent ("something for a beach wedding"), comparison ("warmest winter jacket"), synonym and regional ("trainers" against "sneakers"), negation ("jacket, no logo"), branded, and ambiguous ("gold").
  3. Add 10 to 15 out-of-catalog controls as their own band. These are queries your catalog cannot satisfy, like "navy midi dress" when you carry none. They do not get a volume tier, because there is no head query for a product you don't sell. The correct behavior is a graceful "no match" or honestly-labeled near-alternatives, never a confident wrong product. This is the band almost every eval omits, and it is the one that catches hallucination.
  4. Write the expected outcome for each query. Not the exact product, the acceptance criteria. "Any in-stock waterproof boot in size 9." "Must respect 'no logo.'" "Should return zero results or clearly-labeled alternatives." This is what makes scoring reproducible across two different people.
  5. Freeze it. Save the set. From here it is your regression suite. Re-pull and refresh it quarterly as the catalog and demand shift, but never edit it in the middle of an evaluation.

Each intent type maps onto a capability you are testing. Synonym and regional queries test query expansion and abbreviation handling. Exact-product and SKU queries test typo correction and lookup. Attribute and natural-language queries test whether the engine reads meaning at all.

To pressure-test a cell, run it through the persona simulation and the Query Understanding panel in test text search.

The objection: "A hundred queries sounds like a lot of manual work." You build it once. It is reusable forever, and the Golden-Query Eval Template below ships the grid pre-structured, so the work is filling cells, not designing the framework.

The golden-set grid (100 queries)

Two bands. Band one holds the ~85 in-catalog queries, rows by intent type and columns by head, torso, and tail. Band two holds the ~15 out-of-catalog controls, untiered, highlighted on its own. The two bands sum to 100. This grid is the backbone of the downloadable template.

How do I score the results so two people agree?

Score each result set on five axes, each 0 to 3 with written anchors: relevance, intent match, attribute fidelity, conversion potential, and no-wrong-results. Anchored descriptions make scoring reproducible across people. For a quantitative view, convert per-product judgments into nDCG@10, Recall@10, and MRR, which let you compare two systems on the same golden set with one number per metric.

A rubric only works if two people score the same result the same way. That rules out a vague 1-to-10 scale, where everyone has a private definition of a 7. Each axis needs anchored descriptions.

And you score the result set, the top 10 to 20, not a single product. One great result buried under five wrong ones is still a bad search.

Five axes, each scored 0 to 3:

  1. Relevance (0 to 3). Do the results match the query topic? Zero is off-topic. Three is all on-topic and ranked sensibly.
  2. Intent match (0 to 3). Do the results match what the shopper was trying to do, not just the words they typed? "Beach wedding" should surface appropriate dresses, not everything tagged "beach."
  3. Attribute fidelity (0 to 3). Are the hard constraints respected: size, color, material, price, negation? A "no logo" query that returns logoed products scores 0 here, no matter how relevant the rest looks.
  4. Conversion potential (0 to 3). Would a real shopper actually buy from this set? In-stock, on-brand, sensibly priced, decent imagery. This is the merchandiser's axis, and it is the one pure relevance scoring ignores.
  5. No-wrong-results, the safety axis (0 to 3). Are there clearly wrong results near the top, and on the out-of-catalog controls, did the agent avoid a confident hallucinated answer? ShoppingComp formalizes product-safety hazards as an eval dimension; we borrow the principle that a confidently wrong answer is worse than an honest "no match."

For teams that want numbers, convert per-product judgments into the standard metrics the information-retrieval field uses. nDCG@10 asks whether the good results sit near the top. Recall@10 asks how many of your catalog's relevant products made the top 10. MRR asks how far down the first good result lands.

The rubric is the human-readable layer. The IR metrics are the comparable layer. That is exactly how our own system works, scoring against the signal weighting that ranking quality reflects.

Here is the part teams learn late. Add the conversion-potential axis first, not last. A search can be relevant and still surface things nobody buys, and the merchandiser is the only one in the room who catches it.

The five-axis rubric card

Five axes, each 0 to 3, each with a written anchor for what a 0 and a 3 look like. Relevance, intent match, attribute fidelity, conversion potential, no-wrong-results. The full anchored descriptions live in the downloadable template.

How do I A/B test a shopping agent against Shopify native?

To A/B test a shopping agent against Shopify native, run the same frozen golden query set through both systems under identical conditions, blind the scorer to which system produced each result set, and compare scores axis by axis. For live traffic, assign by session ID so each shopper sees one system, and log the arm on every request so you can segment quality and revenue by arm.

Scoring one system tells you whether it is good. Comparing two tells you whether it is better, which is the question you are actually being asked. The harness runs the same frozen golden set through both arms, your candidate agent and your current or native search, scored under identical conditions.

Three rules keep the comparison fair:

  1. Same set, same conditions. Identical golden queries, the same catalog snapshot, the same personas and segments. Nothing varies except the search system itself.
  2. Blind the scorer. Whoever assigns the rubric scores, a person or an LLM judge, should not know which arm produced a given result set. That removes the "we want the new thing to win" bias. If you use a judge, strip the arm label out of its input.
  3. Deterministic, session-stable assignment for live traffic. When you graduate from offline scoring to a live split, assign by session ID so a shopper sees one consistent system across their visit, and log the arm on every search request so quality scores and revenue can be segmented by arm later.

Order matters. Offline scoring on the golden set comes first, because it is cheap, fast, and safe. Only when the candidate wins offline do you route a small live traffic split and watch business outcomes. Both signals, the judge scores and the revenue, should point the same way before you graduate the candidate.

To see what each system actually returned per query, use revenue attribution and anomaly detection in insights and per-decision review in the audit log. The session-based assignment pattern is the same one I described in how we evaluate search quality at scale.

The objection: "We can't run two search systems at once." You don't have to. Offline scoring needs only the result outputs from each system, run separately against the same set. The live split comes later, and only once the offline winner is clear.

The A/B harness

One frozen golden set feeds two arms: candidate agent and native search. A blind scorer grades both on the five axes. Output is an axis-by-axis comparison, not a single winner-take-all number.

How do I catch search regressions before they reach shoppers?

Catch regressions by re-scoring the same frozen golden set after every change and diffing query by query, counting how many improved, regressed, or stayed flat past a score-delta threshold. Gate on the tail and branded buckets, not the aggregate, because a change can lift the head average while quietly degrading the tail queries that aggregate scores hide.

A golden set's real payoff is the second run. Any change at all, a model update, a synonym tweak, a re-index, an agent version bump, gets validated by re-scoring the same set and diffing against the prior run.

Three rules:

  1. Re-score the same set. Internally we call this the reuse pattern. Do not pull fresh queries, because that breaks comparability. Run the identical frozen set against the new configuration and compare query by query.
  2. Count improvements, regressions, and unchanged by a score-delta threshold. A change is not "better" because the average ticked up. Look at how many individual queries got worse, and by how much.
  3. Gate on the tail and branded buckets, not the aggregate. This is the failure mode that ships. A change lifts the head average while quietly degrading tail or branded queries, and the aggregate score hides it. We caught exactly this internally once: a weight change improved nDCG@10 overall while dropping tail-branded queries by 12 points. Break results out by bucket every time, and block any change that regresses a bucket past your threshold.

Run this on a schedule, before every deployment and weekly as a watch, and the golden set stops being a one-time audit. It becomes a standing regression suite.

The tooling carries some of this for you. We replay affected queries against a proposed change and measure impact before it deploys, through a review-before-deploy lifecycle in search-quality optimizations, and insights flags anomalies like a sudden spike in zero-result searches. If a regression pushes queries to empty, that shows up as a zero-result rate jump first.

The reuse run is the part that earns the whole exercise. Same queries, new configuration, count what got worse. The first time it catches a regression you were about to ship, a synonym tweak that looked like a clean win in aggregate but was quietly gutting your branded tail, it pays back the afternoon you spent building the set ten times over.

Can I run this without engineering help?

Run the Golden-Query Method with nothing more than a spreadsheet and an afternoon: list the 100 queries, score each system on the five-axis rubric by hand, and compare. Scale later with an LLM judge if you want to run it weekly or past 100 queries. Built-in persona testing and per-decision inspection speed up scoring and explain why each result appeared.

The method does not require building infrastructure. The smallest version that works is a spreadsheet: the golden-set grid, the rubric with anchors, and two columns for "current search score" and "candidate agent score." Score by hand for the first run.

  • Manual first pass. One merchandiser and one technical lead, scoring blind, half a day. The merchandiser owns the queries and the conversion-potential calls; the technical lead owns the harness and the consistency.
  • Where built-in tooling helps. Testing as personas, the Query Understanding panel, and per-decision inspection let you see why a result happened, which speeds up both scoring and debugging. You can run the same approach against collections and similar products and image search, not just text.
  • When to automate. Once the set is stable and you want it as a pre-deploy gate, move to an LLM judge over the frozen set, which is exactly what our internal loop does.

If you want the do-it-with-us path instead, book a demo and we'll score your current search against a candidate agent live, on your top queries.

What is a good search worth?

The payoff of evaluating before you ship is the conversion you don't lose. With our search, merchants see roughly +13% conversion rate and +14% revenue per visitor with stable average order value, and Rainbow Shops saw a +30% conversion lift after rebuilding their search stack. Those gains come from shipping search that works, and an evaluation like this is how you verify it in advance.

The gap between a search that works and one that doesn't is measured in conversion, not in vibes. I want to be precise about what these numbers are and are not. None of them come from running an eval. They come from shipping search that an eval like this would have validated first.

So the honest framing is this. The method does not move conversion. It is how you find out whether a change is one of the good ones before you bet conversion on it.

"Layers was the first time we were able to create some customization and essentially create the same kind of sort-orders that we were used to having in Salesforce."

David Cost, VP of eCommerce and Digital, Rainbow Shops

Run the method

Build the 100 queries once. Score what you have today. Then score anything you are asked to trust against the same set, every time. The demo will always look fine. The golden set is how you find out whether it actually is.

Want us to run it with you? Book a demo and we'll score your current search against a candidate agent on your own catalog. Rather start solo? Download the Golden-Query Eval Template and run your first pass this afternoon.

FAQs

1. How do I test if an AI shopping agent works on my catalog? Build a fixed, catalog-specific golden query set, score the agent against a written rubric, and compare it to your current search. The Golden-Query Method runs this as a four-stage loop: build 100 queries across volume tiers and intent types, score on five axes, A/B against your baseline, and re-score after every change. A demo only shows rehearsed queries, so it cannot tell you whether the agent works on yours.

2. What is a golden query set for ecommerce search? A golden query set is a fixed, curated list of real shopper queries you run against a search system every time you evaluate it, so results stay comparable across runs and systems. It is engineered, not randomly sampled: a random sample over-weights easy head queries and misses the tail, attribute-heavy, and out-of-catalog queries where agents actually fail.

3. How do I A/B test a shopping agent vs Shopify native? Run the same frozen golden set through both systems under identical conditions, with the same catalog snapshot and personas. Blind the scorer to which system produced each result set, then compare scores axis by axis. Score offline first, and only route a small live traffic split, assigned by session ID, once the candidate wins offline and you want to confirm business outcomes.

4. How many queries should a golden set have? A hundred is the working floor for a Shopify Plus catalog. That is enough per bucket to see a pattern, across head, torso, and tail volume tiers, the eight in-catalog intent types, and an out-of-catalog control band, and small enough to score by hand or with a judge in an afternoon. Over-weight the tail because that is where agents quietly fail.

5. What metrics should I use to evaluate ecommerce search? Use a five-axis rubric for human-readable scoring (relevance, intent match, attribute fidelity, conversion potential, no-wrong-results), then convert per-product judgments into the standard information-retrieval metrics for the comparable view: nDCG@10 (are good results near the top), Recall@10 (did the relevant products make the top 10), and MRR (how far down the first good result sits).

6. How do I catch search regressions before they reach shoppers? Re-score the same frozen golden set after every change and diff it query by query, counting how many queries improved, regressed, or stayed flat past a score-delta threshold. Gate on the tail and branded buckets rather than the aggregate, because a change can lift the head average while quietly degrading the tail queries that aggregate scores hide.

7. Why isn't a demo enough to evaluate an AI shopping agent? A demo runs the queries the agent already handles, so it always looks good. Real catalogs break on regional terms, stacked attributes, natural-language intent, and out-of-catalog queries that no demo rehearses. Published benchmarks like ShoppingBench and ShoppingComp show even frontier models fail most real shopping tasks, so a confident demo tells you nothing about your own products.

8. What are out-of-catalog control queries and why do they matter? Out-of-catalog controls are queries your catalog cannot satisfy, such as "navy midi dress" when you carry none. They matter because an agent that confidently hallucinates a wrong product is worse than one that says "no match." Include 10 to 15 in every golden set; the correct behavior is a graceful empty result or honestly-labeled alternatives, never a confident wrong answer.

Jake Casto · Founder, Layers

Jake Casto is the founder of Layers, the enterprise search and merchandising platform built for Shopify Plus. He previously co-founded Proton, a Shopify Plus engineering studio that shipped more than 400 storefronts, where Layers began as an internal tool for a problem that kept repeating. He writes about search infrastructure, performance, and the engineering behind discovery at scale.

Connect on LinkedIn