Applied AI

The retail merchandising stack is being rewritten by LLMs — quietly

9 min read · March 2026

The most consequential shift in retail technology right now isn't happening on the storefront. It is happening one layer down, in merchandising. The storefronts our clients are shipping in 2026 are recognizably the same storefronts they shipped in 2023 — slightly faster, slightly more personalized, slightly cleaner on mobile. The merchandising systems behind them are not recognizable at all.

Assortment, search, and pricing — the three engines that actually decide what a shopper sees and what it costs — have been governed for a decade by rules engines and hand-tuned heuristics. Those systems are quietly being replaced. Not by a single model, and not by the LLM the press cycle is focused on. They are being replaced by a stack of learned components glued together with LLMs handling the messy parts: attribute extraction from supplier feeds, query understanding, competitor parsing, exception explanation, and the long tail of edge cases that rules could never cover cleanly.

The shift is quiet because the user-visible surface barely changes. A category page still shows products. A search box still returns results. A price tag still displays a number. What has changed is the operating model behind each of those surfaces, and the operating model is where the durable economics live.

In assortment, the change is from category managers writing rules to category managers reviewing model output. For ten years, the job of an assortment planner was largely the job of authoring constraints — promote this brand in this region, suppress this SKU in these stores, never show this product above this one. The constraints accumulated. New planners inherited rule sets they did not write and could not safely modify. The system became unauditable in practice, even when it was auditable in theory.

The new shape of the job is different. A learned ranker proposes the assortment for a given context. The planner reviews exceptions, curates the training signal, and adjudicates the cases where the model is confidently wrong. The merchant's judgment still matters — arguably more, because it is now encoded as labeled data instead of as opaque rules. Teams that have made this transition consistently report two outcomes: the backlog of pending rule changes collapses, and the response time to a real demand shift drops from weeks to days.

The transition is not free. The hardest part is not the model. The hardest part is convincing a merchandising organization that has measured its own value by the volume of rules authored that the new measure of value is the quality of the exceptions caught. That is a compensation conversation, an org-design conversation, and a change-management conversation before it is a machine-learning conversation. The retailers who treat it as primarily a technical problem are the ones still stuck.

Search is the most visible shift, and the easiest to underestimate. Semantic retrieval, often with an LLM-rewritten query in front of it, is displacing the keyword-and-synonym dictionaries that ranking teams have maintained for years. The wins are not just on long-tail queries — though those wins are real and immediate. Head queries also improve, because the model understands intent in a way that synonym lists never did. A query like 'gift for a runner' used to fall through to a generic fallback. It now returns a curated set that reads as if a human merchandiser put it together.

The cost curve is the new constraint, and it is dropping fast. Eighteen months ago, running an LLM in the search path at head-query volume was an unacceptable line item. Today it is a budgeting question, not an architecture question. The teams that built their search stack to assume the cost would stay high are now rebuilding. The teams that built for the cost to drop are scaling.

There is a second-order effect in search that is worth naming. Once retrieval is semantic, the catalog data quality problem becomes the bottleneck. Models cannot return what the catalog cannot describe. Retailers who under-invested in attribute coverage, structured descriptions, and image labeling are now discovering that the limit on their search quality is not the model — it is the catalog. The model has made the data debt visible.

Pricing is the quietest and most interesting front. Classic elasticity models are not going away, and they should not. They are the right tool for the specific job of estimating how demand responds to price within a stable competitive context. What is changing is everything around them. LLMs are eating the surrounding workflow: parsing competitor pages without a brittle scraper, normalizing third-party feed data, explaining price moves to merchants in plain language, drafting the communication to suppliers when a cost change forces a price change, and flagging anomalies that previously required an analyst staring at a spreadsheet.

The model is rarely setting the price directly. It is making the humans who set the price meaningfully faster, and it is letting smaller teams cover larger catalogs. A pricing team that used to manage two thousand SKUs per analyst now manages ten thousand, with better margin discipline and a smaller backlog of un-reviewed competitive moves. That is the change that shows up in the P&L, and it does not require a single price to be set autonomously.

The durable moat is not the model. Every retailer can rent the same frontier model. Every retailer can fine-tune on roughly the same public data. The moat is proprietary signal — clean product data, behavioral logs at sufficient scale, returns and post-purchase data tied back to the original session, and the merchandising judgment encoded in years of human decisions on which products to feature, suppress, or promote. None of that is rentable.

Retailers who under-invested in data quality are now paying for it twice. Once to clean it up — a multi-quarter program that is far less glamorous than the AI initiative it is funding. And once in the opportunity cost of every quarter they wait, because the competitors who invested in catalog quality five years ago are now compounding on it. The gap is widening, not closing.

Where retailers are spending real money on what will turn out to be table stakes: bespoke chat interfaces on the storefront, generic recommendation widgets that wrap a model around an existing carousel, internal copilots that put a chat box in front of a search bar, and PDF generators for buyer enablement. These features will all exist in eighteen months as commodity capabilities, either bundled by the commerce platform vendors or available as inexpensive third-party plugins. The spend is not wrong — every retailer needs these features — but it should not be confused with differentiation.

Where the spend is actually compounding: the merchandising operating model itself. The retailers pulling ahead are the ones rebuilding how category managers, planners, and pricing analysts work day-to-day. They are redefining roles, retraining teams, restructuring the weekly cadence of merchandising decisions, and rewiring the data flows that connect supplier onboarding to assortment planning to search ranking to pricing. None of that work shows up in a product launch. All of it shows up, twelve months later, in conversion, margin, and inventory turnover.

There is a related pattern in vendor selection. The retailers who are getting value out of AI are not the ones with the most ambitious vendor strategy. They are the ones who treat the model as a substitutable component and invest in the surrounding data infrastructure and the workflow integration. They can swap one frontier model for another in a quarter, because the value sits in the pipeline around the model, not in the model itself.

The retailers who are getting the least value are the ones who picked a single AI partner, signed a multi-year commitment, and outsourced the strategy. Those teams are now discovering that the partner's roadmap is not the same as their roadmap, and that the moat they thought they were building belongs to the partner.

Eighteen months from now, the storefront will look only modestly different. A little faster, a little more personalized, a little better at search. The merchandising organization behind it will be unrecognizable. Smaller teams, different roles, different cadences, different metrics. The retailers who understand that the shift is an operating-model shift, not a feature shift, will be the ones quietly taking share from everyone who waited for the visible UI change to arrive.

The window to act is shorter than it looks. The model commoditizes quickly. The data and the operating model do not.

A practical note on team composition, because the question comes up in every engagement. The retailers who are moving fastest on this are not the ones with the largest data-science organizations. They are the ones who paired a small applied team — three to six engineers with strong shipping discipline — with an empowered merchandising counterpart who could make weekly decisions about what to ship next. Larger teams that we have seen tend to over-invest in research and under-invest in the integration work, and the integration work is where the value lives. The frontier model is not the bottleneck. The pipeline around it is.

There is a related point about evaluation. The retailers who are getting honest results have built a small set of offline evaluation harnesses — for assortment quality, for search relevance, for pricing decisions — that they trust enough to make weekly model swaps without a months-long online test. The harnesses are not perfect, and the teams know it. They run online tests for the decisions that warrant them, and they use offline evaluation for the much larger volume of decisions that do not. The teams without these harnesses end up either over-testing — which slows them to the speed of the slowest experiment — or under-testing, which corrupts their judgment about what is actually working.

One quiet structural change deserves attention. The role of the merchandising analyst is starting to bifurcate. Half of the work — the rule authoring, the manual exception handling, the spreadsheet reconciliation — is being absorbed by the model. The other half — the strategic judgment, the supplier negotiation, the assortment storytelling — is becoming more valuable, not less, because the analyst has more time to spend on it. Retailers who treat the change as a headcount-reduction exercise lose the second half along with the first. Retailers who treat it as a role-redesign exercise gain capacity they did not have before. The same headcount delivers a meaningfully larger surface of work.

The compliance and explainability layer is the underdiscussed risk. Once a learned system is making merchandising or pricing decisions at scale, the regulator's question — and the merchant's question, and the supplier's question — is the same: why this product, why this price, why this ranking. The retailers who built explanation into the system from the start have a defensible answer. The retailers who bolted it on after the fact are now spending real engineering time reverse-engineering decisions their own system made. The cost of explainability is much lower if it is part of the design than if it is added later.

Builders Newsletter

Get our field notes in your inbox.

One thoughtful read a month on what's shipping in commerce, AI, cloud, and security — from the engineers building it.

No spam. Unsubscribe anytime.