Skip to content
← Back to Blog
·7 min read·Hass Dhia

Why Walmart's OpenAI Commerce Pilot Failed While Macy's Agentic Tool Delivered 4.75x Revenue Per Order

agentic-airetail-technologydata-infrastructurebehavioral-economicsbrand-strategy

Six months after Walmart and OpenAI launched their "Instant Checkout" pilot - where consumers could buy Walmart products directly within ChatGPT - Walmart shut it down. Conversion rates were three times lower than when shoppers clicked through to Walmart's own properties. Meanwhile, Macy's quietly reported that orders placed through its proprietary AI shopping assistant, "Ask Macy's," generated 4.75 times more revenue per order than those that did not use the tool at all.

Both projects carried the "agentic commerce" label. One failed. One worked. The approximately 14x gap in outcomes between them is not a product story. It is an infrastructure story - and the companies ignoring it are making the same strategic mistake that produces every technology adoption failure cycle.

The Outcome Gap Nobody Is Talking About

Adweek's coverage of these two pilots lands with uncomfortable clarity: "No one - platforms, retailers, third-party tech companies, or marketers - have figured it out." The editors are not being hyperbolic. They are describing a structural condition.

Macy's Chief Customer and Digital Officer Max Magni attributed the 4.75x revenue result to "Ask Macy's," a system built on Google Gemini running against Macy's own customer data, purchase history, and preference graph. The AI is powerful. But the power comes from Macy's data, not from the model. Every recommendation the agent makes is grounded in what Macy's actually knows about the person asking.

Walmart's Instant Checkout pilot inverted this architecture. Walmart's product catalog, pricing, and availability were being queried through OpenAI's interface - a layer Walmart did not control, could not optimize, and could not govern. The result was a leaky transaction layer with no stable common definitions between Walmart's operational data and the model's representation of it.

What makes this more than a cautionary tale is the fragmentation sitting underneath it. Google is building a Universal Commerce Protocol for agent transactions. PayPal and Mastercard are each building separate payment protocols. OpenAI has its own layer. Every major platform is constructing the standard it wants to win, which means no standard is winning. The trust gap that emerges when AI agents operate without authoritative data is not going away - it is widening as the protocol landscape fragments further.

What McKinsey's Seven-Principle Framework Actually Says

McKinsey's analysis of enterprise agentic AI adoption arrives at a finding that explains the Walmart result precisely: 80% of companies cite data limitations as the primary roadblock to scaling AI agents. Not model quality. Not compute cost. Not organizational buy-in. Data.

The framework McKinsey identifies requires seven principles for data architecture:

  1. Treat data ingestion as a product - make data enter once and be usable everywhere
  2. Share meaning, not just data - common definitions so agents, models, and analytics interpret information identically
  3. Use one data foundation for both analytics and AI - build it once, use it across all applications
  4. Build trust and governance into the platform by default, not as a retrofit
  5. Expose capabilities through stable interfaces that teams can rely on without rework
  6. Make behavior visible and measurable - observability built in from the start
  7. Provide controlled, governed execution environments for AI agents

Macy's architecture reflects this framework closely. The AI runs against Macy's own data graph, with Macy's own governance. Walmart's pilot violated principle one immediately: a second ingestion layer with no shared meaning between Walmart's catalog and the model's representations of it.

McKinsey's finding that fewer than 10% of enterprises that experiment with agents actually scale them to tangible value is not a statement about AI capability. It is a statement about data readiness. The 10% built the foundation first. This is the kind of infrastructure-level pattern that looks obvious in retrospect and gets ignored in the moment because narrative moves faster than architecture. This is also the pattern STI's research tracks systematically - where the bottleneck to AI-driven outcomes is almost never the model.

The Compounding Logic

Once an agent has reliable access to high-quality, well-governed data, its performance improves over time. Every transaction enriches the training signal. Every enrichment makes the next recommendation sharper. The Macy's system gets better every month it runs.

A cross-platform deployment does not improve along this curve without structural change to the underlying architecture. It accumulates noise instead of signal. The gap does not stay at 14x - it compounds in the direction of whoever built the foundation first.

McKinsey's research shows that the top 6% of AI performers are 3.6 times more likely to pursue enterprise-level AI transformation rather than incremental deployment. They are 55% more likely to fundamentally redesign workflows when deploying AI. These organizations are not doing more of the same thing faster. They are building different organizational capabilities that make the next transformation easier than the last.

Agentic Commerce as Ozempic for Strategy

Nick Maggiulli's Signal Collapse uses a deceptively simple observation to make a structural point. For decades, physical fitness was the ultimate proof of work - you could not fake it, you had to earn it through sustained effort. Then GLP-1 agonists arrived. Ozempic and Wegovy let anyone acquire the outcome without the underlying work. Overnight, the signal devalued.

The same dynamic is playing out with "agentic commerce" as a strategic claim.

Calling your roadmap "agentic" has become a way to acquire the signal of being at the frontier without doing the infrastructure work the frontier actually requires. The label is cheap. McKinsey's seven-principle data architecture is not. And because the label is cheap, it gets claimed broadly - which accelerates the signal collapse Maggiulli describes.

Maggiulli identifies four strategies for demonstrating genuine value when old signals erode: leverage your history, do deep work, command attention, and embrace the machine. Cal Newport's framing, cited in the piece, lands clearly: in the new economy, groups with a particular advantage are "those who can work well and creatively with intelligent machines, those who are the best at what they do, and those with access to capital."

The enterprise equivalent of "those who are the best at what they do" is organizations with the cleanest, most domain-specific data foundations - built before the agentic narrative caught fire. That is the new proof of work. We have written previously about how the brand proof era shifts value away from narrative claims toward demonstrated outcomes. Agentic commerce is the same dynamic with a different label on it.

The companies with proven data infrastructure built before the hype cycle have the history that cannot be faked. The organizations doing the deep architecture work now, when it is still operationally hard, are building moats that signal-acquirers will not be able to replicate when the market consolidates around the handful of implementations that actually work.

Why Intelligent Organizations Keep Skipping the Infrastructure Work

If the data is this clear - proprietary agents with strong data foundations outperform borrowed platforms by 4.75x against 3x-lower across comparable use cases - why are so many organizations still buying the cross-platform narrative?

Behavioral economist Guy Hochman's 2025 framework, which he calls Homobiasos, offers the most coherent answer. His argument is that humans are not irrational by accident. Our cognitive priority is coherence preservation over accuracy. When psychological coherence and factual accuracy conflict, coherence wins - not because of ignorance, but because of evolutionary design. We construct narratives that maintain our moral self-image and social standing, then defend those narratives against contradictory evidence.

Applied to enterprise AI strategy, this explains a specific pattern. Organizations adopt the "agentic commerce" framing not because the evidence supports it, but because it preserves the self-image of being strategically advanced. When the Walmart pilot produced poor conversion rates, the coherence-preserving interpretation was "we learned something valuable and are iterating" rather than "we skipped the infrastructure work and got the predictable result."

Hochman's concept of "motivated moral reasoning" - constructing narratives that maintain self-image rather than reflect data - is precisely what drives the gap between the companies that acknowledge data limitations as a blocker (the 80% McKinsey identified) and the ones actually doing something structurally different about it.

The $79 Billion Brand Architecture Deflection

The Paramount-Warner Bros. merger is an extreme version of this dynamic. With $79 billion in debt on the table, the strategic conversation is largely framed as a brand architecture question: which portfolio brands survive, which get collapsed, which get elevated. Brand architecture is a real discipline. But it is also the coherence-preserving frame that sidesteps the harder operational question - does the merged entity's data infrastructure actually support the personalization, recommendation, and distribution systems that define competitive advantage in streaming?

Brand names are the cheapest part of the stack to rename. The data foundations are not.

The agentic advertising reality check this industry needs is not about which companies have the best AI narrative. It is about which ones have the data architecture that makes the narrative accurate.

The Compounding Advantage of Getting This Right Early

The Macy's result is not just a win. It is the early data point in what becomes a structural advantage. Every transaction through "Ask Macy's" enriches the preference data Macy's already owns. The system is self-reinforcing in a way the borrowed-platform model is structurally not - not because of model quality, but because of data ownership.

The industry parallel Adweek draws is programmatic advertising, which also required years of standardization before scaling. That is accurate, but it undersells the compounding problem. In programmatic, the companies that built the data infrastructure early - the DMPs, the audience graphs, the attribution systems - did not just participate in the market when it matured. They became the market. The retailers and brands that waited for standards to emerge found themselves licensing data back from the infrastructure players they had declined to become.

The same outcome is structurally available in agentic commerce. The organizations that build McKinsey's seven-principle data foundation now will not just perform better in the short term. They will own the preference data, the behavioral graph, and the governance systems that any emerging standards ultimately need to run against. The borrowed-platform players will need to buy access to what the infrastructure builders already own.

The infrastructure is the moat. The narrative is the distraction. The gap between the organizations that understand this today and the ones still optimizing the narrative is exactly as wide as the 14x outcome difference between Ask Macy's and the Walmart/OpenAI pilot - and it is compounding in one direction.

If you are evaluating where your organization sits on this infrastructure-to-narrative spectrum, STI's analysis tools can help surface what your current data architecture actually supports - and where the gap between the roadmap claim and the operational reality is widest.

Want more insights like this?

Follow along for weekly analysis on brand strategy, market dynamics, and the patterns that separate signal from noise.

Browse All Articles →

Or explore partnership opportunities with STI.

Related Articles