May 15, 2026

Where the productivity went.

Andrej Karpathy's pitch — less AGI hype, more partial autonomy, more humans in the loop — is the cleanest description of what is already working inside the 5% of enterprises getting real return from AI. The other 95% bought a product called autonomy. The model is the same. The bet is different.

$30 billion went into enterprise AI in 2025. 95% of the pilots returned nothing. Amazon's Kiro mandate deleted production and took Amazon.com down. Stripe's agents ship 1,300 pull requests a week with a human review on every one. The difference isn't the model — it's the layer underneath, and the category the product was sold in.

By Dave Macey13 min read

Key takeaways

MIT NANDA's GenAI Divide report (August 2025) audited 300 enterprise deployments and found 95% of pilots produced no measurable P&L impact across an estimated $30–40 billion in spend.
Frontier models now cluster within a few percentage points on every public benchmark. Model capability is no longer the bottleneck — the harness, context, and organization underneath are.
Amazon's November 2025 Kiro mandate tracked 80% weekly developer adoption as an OKR. By March 2026, the deployment had caused a 13-hour AWS Cost Explorer outage, two Amazon.com outages, and a 99% drop in U.S. sales during the worst.
Stripe's Minions program ships ~1,300 AI-written pull requests per week. Every PR is human-reviewed. Stripe's engineers attribute success to infrastructure built for human engineers years before LLMs existed — not to the AI model itself.
Andrej Karpathy's prescription: less AGI hype and flashy demos, more partial autonomy, custom GUIs, and autonomy sliders. The 5% built augmentation. The 95% bought autonomy. Same model. Different bet.

In a Mumbai meeting room earlier this spring, a CTO told Paul Fipps something he had not put in his vendor's deck. Fipps runs global customer operations at ServiceNow and was a CIO himself before that, which is the kind of background that makes other CIOs talk to him without the deck. This CTO ran technology for one of India's larger financial-services firms. He had spent the prior 18 months building 30 production-grade AI agents for his bank. None of them were live. When asked why, he could not answer basic questions about what those agents had access to, or whether they were doing what the original spec said they would do.

A few weeks later, Fipps got on a call with another CIO, this one running technology for a large U.S. healthcare and life-sciences company. The CIO had been running 900 AI pilots across his organization. He had just canceled all of them. Not because they did not work. Because nobody owned them. His exact phrasing, repeated to Fipps and then in public: "I have a pile of custom software running around that nobody owns."

Two enterprises. Two industries. Same shape.

MIT's NANDA initiative published the chart in August. 300 enterprise AI deployments audited; 153 leaders surveyed; 52 executives interviewed. Out of an estimated $30 to $40 billion in enterprise spend on generative AI, 95% of pilots produced no measurable P&L impact. 5% saw real revenue acceleration. Two of nine major sectors showed material transformation. The other seven sat flat.

Gartner arrived at the same destination from a different direction. It expects more than 40% of agentic AI projects to be canceled outright by the end of 2027 and more than 60% of early agentic orchestration deployments to miss their performance or cost targets by 2030. The April 2026 Gartner I&O survey gave the freshest number: 57% of infrastructure leaders reported at least one AI project failure, and only 28% delivered the promised return.

01U.S. enterprise generative-AI spend vs measurable P&L return, 2025

$30 billion in. A few cents back.

MIT NANDA audited three hundred enterprise GenAI deployments. Five percent of the spend produced measurable P&L impact. The remaining ninety-five percent is the chart.

SourceMIT NANDA — State of AI in Business 2025

This was supposed to be the productivity decade.

Two years in, the chart has not moved. The question every board worth its retainer should be asking is the same: why.

The MIT authors named it carefully. They called it a "learning gap" — the inability of AI to retain context across sessions, adapt to a specific organization's workflow, or improve from one engagement to the next. The people inside the buildings put it more plainly. A lawyer in the MIT study described what working with these tools actually feels like:

“It's excellent for brainstorming and first drafts, but it doesn't retain knowledge of client preferences or learn from previous edits. It repeats the same mistakes and requires extensive context input for each session. For high-stakes work, I need a system that accumulates knowledge and improves over time.”

— Lawyer interviewed in the MIT NANDA study

A CIO in the same study, asked about a year of vendor pitches:

“We've seen dozens of demos this year. Maybe one or two are genuinely useful. The rest are wrappers or science projects.”

— Fortune 500 CIO, MIT NANDA study

Andrej Karpathy said the same thing more directly. Karpathy was the director of AI at Tesla and a founding member of OpenAI; he has lived through every layer of the stack that other people are still trying to learn the vocabulary of. In a June 2025 talk that became the most-shared technical address of the year, he made the diagnosis explicit.

“Less AGI hype and flashy demos that don't work. More partial autonomy, custom GUIs, and autonomy sliders.”

— Andrej Karpathy

His framing is Iron Man. The suit extends Tony Stark in two useful ways: it augments him with strength and sensors, and it sometimes takes initiative on its own. What makes the suit work is that Tony Stark stays at the center of it. The version where the suit walks off without him is the version every comic-book reader knows ends badly. For two years, the AI industry has been selling that second version under the name "agent."

The receipts

The receipts are not theoretical. McDonald's spent three years on an IBM-powered drive-thru voice-ordering pilot. The system plateaued at 80 to 85 percent order accuracy. Human order-takers hit 90. In June 2024, McDonald's pulled the system from over a hundred U.S. locations and went back to headsets. The viral TikToks were a clarifying signal — nine sweet teas appended to a single order; butter and ketchup packets inserted into ice cream — but the underlying issue was structural. The acoustic environment of a drive-thru lane breaks the assumptions the model was trained on. No prompt engineering fixes that.

In February 2024, the British Columbia Civil Resolution Tribunal ruled in Moffatt v. Air Canada that the airline was bound by what its chatbot told a customer, because the chatbot was the airline. Jake Moffatt had been trying to fly to his grandmother's funeral. The chatbot promised a bereavement-fare retroactive refund. Air Canada's actual policy was the opposite. The court ordered $650 CAD in damages. The case made global press. For a generation of travelers, Air Canada is now the airline whose chatbot told a grieving man a lie.

In January 2024, the U.K. parcel firm DPD updated its customer-support chatbot. The update went bad. Within a day, the bot was swearing at a customer and describing its own employer in unprintable terms. A screenshot hit 800,000 views in 24 hours. The chatbot came down. The brand damage did not. (We made the longer version of the brand-damage argument in ["AI Isn't Just Replacing Your Workforce. It's Replacing Your Brand."](/insights/ai-replacing-your-brand))

In May 2025, Klarna CEO Sebastian Siemiatkowski — who had spent the prior year being held up as the case study for AI replacing customer-service workers — admitted publicly that the company had gone too far. Klarna started rehiring humans. His exact diagnosis: "As cost unfortunately seems to have been a too predominant evaluation factor, what you end up having is lower quality."

Klarna's reversal was the rare CEO admission of failure. The more common posture was Salesforce's. In September 2025, Marc Benioff announced cuts of roughly 4,000 customer-support roles, justifying the decision with one sentence that circled the industry: "I need less heads." Benioff was celebrating the fact that Salesforce's own Agentforce product had let him fire nearly half of the support division. The Klarna and Salesforce announcements landed 9 months apart and pointed in opposite directions. The lesson does not appear to be propagating quickly between vendor and customer.

04Named AI customer-trust incidents, 2024–2026

Trust takes a generation to add. A week to subtract.

Five named AI incidents in two years. Each was made by a system with no authority to absorb the consequence — and each drew against a trust ledger that took the company a generation to build.

SourceBC Tribunal, 2024·Bloomberg, 2025·Paddo.dev, 2026

Five companies. Five sectors. The same shape of failure. None of them used the wrong model. All of them deployed the wrong relationship: a system that produced answers without the authority to absorb the consequences of being wrong.

The productivity that wasn't

The same pattern shows up where the failure is internal, where there is no viral TikTok and no small-claims judgment to make it legible — only a measured slowdown in the work the company was supposed to be getting faster at.

METR ran a randomized controlled trial in 2025. The subjects were experienced developers. The tasks were real, pulled from the developers' own repositories. Half were allowed to use AI assistants. Half were not.

The developers using AI reported feeling roughly 20% faster. They were measured 19% slower.

02Perceived vs measured productivity for developers using AI tools, 2025

The same human, the same workday, a 39-point gap.

METR ran a randomized controlled trial on experienced developers using their own repositories. The tool produced the feeling of productivity without the substance — and the feeling is what gets reported up the chain.

SourceMETR — 2025 AI Productivity Paradox Study

The 2026 Harness State of Engineering Excellence found the same pattern at the team level. Pull-request volume up 98 percent. Review time up 91. Code churn from 3.1 to 5.7. Security vulnerabilities in AI-generated code 2.74 times higher than in human-written code. Individuals felt faster. Teams shipped at the same pace, with more bugs.

The most damaging version of this is not the slower work. It is the conviction that the work is faster. If your most capable engineers believe AI is making them productive while it is measurably making them less productive, no training program fixes the gap. The tool is producing the feeling of productivity without the substance, and the feeling is what gets reported up the chain.

What the mandate was hiding

If the failure pattern needs one 2026 case study, the case study is Amazon Kiro. In November 2025, an internal memo from SVP Dave Treadwell made Kiro — Amazon's homegrown agentic coding tool — the official AI assistant for the Stores division. The target was set as a tracked OKR: 80% of developers using Kiro at least once a week. By the end of January, internal dashboards showed 70% had complied. Sprint adoption was up. The KPI was working.

In mid-December 2025, an AWS engineer handed Kiro what was described as a minor bug in AWS Cost Explorer. Kiro, working autonomously and without meaningful human approval, assessed the situation and concluded that the most efficient path was to delete the entire production environment and rebuild it. Kiro did exactly that. The result was a 13-hour outage of AWS Cost Explorer across Amazon's China regions. The incident was logged internally as one of four Sev-1 outages inside a single week.

The cascade landed on the consumer side a few months later. On March 2, 2026, Amazon.com itself went down for roughly 6 hours — about 120,000 lost orders and 1.6 million website errors. Three days later, on March 5, a second outage was sharper: another 6-hour collapse that produced a 99% drop in U.S. sales for the duration. By March 10, Amazon's engineering leadership had called an emergency all-hands. By the end of that meeting, Amazon had reversed its own mandate. Senior engineer sign-off was now required for any AI-assisted code change deployed by junior staff. Two-person peer review was now mandatory for all production changes — a requirement that had been quietly waived for AI-assisted deployments under the original push.

Roughly 1,500 Amazon engineers protested the original mandate via internal forums. The loudest among them argued what was already obvious to anyone running an evaluation harness: Anthropic's Claude Code beat Kiro on the multi-language refactoring tasks Amazon's engineers actually had to do. The argument was not that AI didn't help. The argument was that the autonomy slider had been set too high, on the wrong tool, at the wrong layer of the stack — and that the mandate was the mechanism that locked it there.

03Amazon Kiro — Nov 2025 mandate to March 2026 reversal

The mandate and the deployment are different artifacts.

Amazon's November 2025 Kiro mandate tracked one number — 80% weekly developer adoption. Four months later, the deployment had delivered a different scoreboard entirely.

SourceThe Register, Feb 2026·Paddo.dev — Kiro escalation

The Amazon story is the demo problem made unmistakable. The OKR — and its proxy, sprint-window adoption — measured one thing: did developers run Kiro? The deployment measured something else entirely. Did the agent ship code that didn't break production? Did the org survive the failures it did cause? Did the engineers themselves trust the tool more than the alternative? Did the leadership team correctly identify what to mandate in the first place? Did the org have humans in the loop at the moments those humans needed to be there? By every measure that mattered, the system shipped a different product than the one the OKR was tracking.

What the five percent actually built

There is a five percent that works, and the most-documented company inside it is Stripe.

Stripe's coding agents are called Minions. They ship roughly 1,300 pull requests a week into Stripe's production codebase. The agents write the code from scratch — no human-written code in the PRs themselves — but every PR is reviewed by a Stripe engineer before it merges. Developers trigger Minions by tagging a bot in Slack. Within ten seconds of the tag, five agents spin up isolated cloud machines, read the relevant documentation, generate code, run linters, push to CI, and open pull requests. The developer goes to get coffee.

Steve Kaliski, the Stripe engineer who has talked publicly about the program, was direct about what makes it work:

“The primary reason the Minions work has almost nothing to do with the AI model powering them. It has everything to do with the infrastructure that Stripe built for human engineers, years before LLMs existed.”

— Steve Kaliski, Stripe engineering

The Stripe harness — a heavily modified fork of Block's open-source Goose agent — is the surface a reader notices. The actual asset is everything underneath: the CI pipeline that catches regressions automatically, the documentation that explains what every internal API expects, the test coverage that turns a broken commit into a visible signal, the permissioning that decides which agent can touch which system. Stripe spent a decade building that infrastructure for human engineers. The AI did not create it. The AI compounded against it.

Shopify made the same bet from a different direction. CEO Tobi Lütke's early-2025 internal memo — leaked, then later expanded into the company's public engineering posture — made AI use a baseline expectation for every employee and every team. Not a pilot. An operating-model change. What followed inside Shopify was 12 months of building the harness, the evals, and the context layer required to make that posture real.

This is the uncomfortable shape of the 5%. They are not running better models than the 95%. They are running the same models, often through the same APIs, often at the same per-token cost. The difference is in the decade of infrastructure underneath — the CI pipelines, the data contracts, the documentation discipline, the eval rigor, the permission models. What looks like AI ROI from the outside is mostly back-pay on infrastructure investments most companies did not make.

Thresh's work with enterprise platform teams sits in this layer of the stack. The decisions about that layer — context pipelines, tool registries, eval discipline, harness architecture, the interchangeable LLM interface — are where the productivity bet is actually made or lost. If you are navigating one of those decisions right now, [we'd want to hear about it](/#contact).

What the model is and what the bet is

Karpathy's diagnosis — partial autonomy, autonomy sliders, humans in the loop — was less a prediction than a description of what was already working. Stripe's Minions, with a human review on every PR, are an autonomy slider. Klarna pulling AI out of front-line customer service and putting humans back in the path is an autonomy slider. The healthcare CIO who canceled nine hundred pilots had no slider to govern. Neither did the Mumbai CTO with his thirty undeployable agents.

The five percent built augmentation. The ninety-five percent bought autonomy. The two are sold by the same vendors. They are not the same product.

A follow-up piece will walk through that layer in detail. The harness. The context pipeline. The MCP layer that has quietly become the connective tissue underneath. The architectural decision to wrap every LLM behind your own interface so the provider becomes a config value. The reason the same model produces a Stripe Minion in one company and a Mumbai bank's 30 undeployable agents in another.

Three questions worth carrying into your next board conversation

Where is your autonomy slider set today — and is anyone measuring whether it's at the right point on the spectrum for the work it's wrapped around?
Are you tracking adoption KPIs (sprint-window usage, seat allocation) or consequence KPIs (production reliability, regression rate, customer trust)? They are not the same chart, and the gap between them is where pilots go to die.
What does the harness you'd need to make this work actually cost — and have you ever named that as a line item separate from the model spend?

Frequently asked questions

What percentage of enterprise AI pilots fail?

According to MIT NANDA's GenAI Divide report (August 2025), 95% of enterprise generative AI pilots produced no measurable P&L impact across an audited sample of 300 deployments, despite an estimated $30 to $40 billion in spend. Only 5% saw real revenue acceleration. Gartner's 2026 outlook is directionally identical: it expects more than 40% of agentic AI projects to be canceled outright by end of 2027 and more than 60% of early agentic orchestration deployments to miss performance or cost targets by 2030.

What did the 5% that succeeded do differently?

They built the infrastructure first — clean data pipelines, eval harnesses, tool registries, documentation, permission models — often years before deploying AI on top. Stripe's Minions program, which ships ~1,300 AI-written pull requests per week with a human review on every PR, is the most-documented public example. Stripe's own engineers attribute the success not to the AI model but to infrastructure Stripe built for human engineers years before LLMs existed.

What is the difference between AI autonomy and AI augmentation?

Autonomy is the agent-replaces-worker framing sold by most vendors. Augmentation is the model-wrapped-in-a-harness, human-in-the-loop framing that Andrej Karpathy and others have publicly advocated. Autonomy compresses headcount on the slide; augmentation compresses cycle time on the work the human is still doing. The 5% of enterprises getting return from AI built augmentation. The 95% bought a product sold as autonomy.

Why did Amazon's Kiro mandate fail?

Amazon's November 2025 mandate tracked weekly developer adoption (80% target as an OKR) but did not measure consequence. In December 2025, Kiro autonomously deleted an AWS Cost Explorer production environment, causing a 13-hour outage. By March 2026, cascading outages on Amazon.com produced a 99% drop in U.S. sales during a 6-hour window. Amazon reversed the mandate on March 10, requiring senior engineer sign-off for AI-assisted code and reinstating mandatory two-person peer review. Roughly 1,500 Amazon engineers had protested the original mandate via internal forums.

Why do developers using AI feel faster but ship slower?

METR ran a 2025 randomized controlled trial on experienced developers working on real tasks from their own repositories. Developers using AI reported feeling roughly 20% faster but were measured 19% slower — a ~39-point gap between perception and reality in the same human on the same day. The Workday and Harness 2026 studies found that 37–40% of AI-saved time is eaten by reviewing, correcting, and verifying AI output, with AI-generated code carrying 2.74× more security vulnerabilities than human-written code.

Sources

Paul Fipps (ServiceNow) at Knowledge 2026: the Mumbai CTO with 30 undeployable agents; the U.S. healthcare CIO who canceled 900 pilots — 'I have a pile of custom software running around that nobody owns.'Fortune — "Amid the SaaSpocalypse, CIOs and CTOs take a harder line with their vendors" (April 2026)
MIT NANDA — The GenAI Divide: State of AI in Business 2025. 95% of enterprise GenAI pilots produced no measurable P&L impact across $30–40B of spend; 'learning gap' framing; quoted lawyer and CIO testimony.MIT NANDA Initiative, August 2025
Andrej Karpathy — Software 3.0 talk: 'Less AGI hype and flashy demos that don't work — more partial autonomy, custom GUIs, and autonomy sliders.' Iron Man / autonomy-slider framing.Latent Space, June 2025
Gartner — over 40% of agentic AI projects will be canceled by end of 2027; 60% of early agentic orchestration deployments will miss performance or cost targets by 2030.Gartner via Search Engine Land, 2026 agentic AI outlook
Gartner I&O survey, April 2026 — 57% of infrastructure leaders reported at least one AI project failure; only 28% delivered the promised return.Gartner, April 2026
Stripe's Minions — one-shot end-to-end coding agents, ~1,300 PRs per week, every PR human-reviewed; built on a heavily modified fork of Block's open-source Goose.Stripe Engineering Blog
How Stripe's Minions ship 1,300 PRs a week — Steve Kaliski: 'The primary reason the Minions work has almost nothing to do with the AI model... It has everything to do with the infrastructure that Stripe built for human engineers, years before LLMs existed.'ByteByteGo
Amazon Kiro — SVP Dave Treadwell's November 2025 mandate; 80% weekly developer adoption tracked as an OKR; Kiro autonomously deleted AWS Cost Explorer production environment in December 2025, causing a 13-hour outage in China.The Register, February 2026
Amazon Kiro — cascading consumer outages on Amazon.com (March 2 and March 5, 2026) including a six-hour collapse and ~99% drop in U.S. sales during the disruption.Paddo.dev — "Amazon's AI Outages Escalated. So Did the Denial."
Amazon mandates senior approval for AI-assisted code following the Kiro outages; ~1,500 Amazon engineers protested the original mandate via internal forums.Awesome Agents, 2026
Salesforce — Marc Benioff lays off ~4,000 customer-support roles in September 2025, justifying the cuts with "I need less heads," attributed to Agentforce.Tech.co — Companies Replacing Workers with AI
METR randomized controlled trial — experienced developers using AI tools felt ~20% faster but were measured ~19% slower on real tasks from their own repositories.METR, 2025 AI Productivity Paradox study
Harness 2026 State of Engineering Excellence — PR volume +98%, review time +91%, code churn 3.1% → 5.7%, AI-generated code 2.74× more security vulnerabilities.Harness, 2026
Klarna CEO Sebastian Siemiatkowski reverses AI customer-service strategy: 'We went too far.' Klarna resumes hiring humans.Bloomberg, May 2025
McDonald's ends three-year IBM AI drive-thru voice-ordering pilot after order-accuracy plateau (~80–85% vs ~90% human baseline) and viral failure videos.CNBC, June 2024
Moffatt v. Air Canada — BC Civil Resolution Tribunal rules the airline liable for misinformation from its chatbot; $650 CAD in damages.American Bar Association, February 2024
DPD UK chatbot pulled after swearing at customers and criticizing its own employer; viral post hit 800,000 views in 24 hours.CX Today, January 2024
Inside Shopify's AI-first engineering playbook — CEO Tobi Lütke's early-2025 memo making AI use a baseline expectation.Bessemer Venture Partners

Author

Dave Macey

Next insight