Why so many enterprise AI projects blow up in production — and what has to exist underneath if you want to ship something a business can actually run on.
Stories so far: #1 — Claude Opus 4.7 says not to vibe code IDP · #2 — Exploratory agent takes down Salesforce · #3 — The token bill that ate the pilot
I’m in customer AI conversations multiple times per day. From the consultant seat — a MuleSoft and Salesforce delivery partner handling anything from SMB to the S&P 500, an AI-native firm that itself runs on Claude — the same failure pattern keeps repeating across the market, and the stories coming out of it are stacking up faster than I can write them.
The damage isn’t the rebuild bill. It’s the eighteen months of executive momentum that vaporize because the first attempt blew up in front of leadership. Companies don’t get many cracks at this.
That’s the pattern this series is going to dissect.
A buyer is told, sometimes by an internal “tech-savvy” stakeholder, sometimes by a sister tech company, occasionally by a buddy with a slick demo, that AI alone can replace what they’re already paying their integration platform and CRM to do. They wire a prototyping tool into a production system. It works in the demo. It fails in the field.
The whole AI ambition takes a credibility hit the technology didn’t deserve and forward momentum is lost.
What’s actually missing when “AI alone” goes to prod
The honest answer to “should we use AI for this?” in the enterprise is almost never just AI. It’s a stack — and the specific products you pick matter less than the categories of capability that have to be filled underneath.
A few of them, in plain language:
Governance of data. What the model can see, what it can’t, where authorization lives, how PII gets masked, how a compliance category travels with a record through every stage of the pipeline. The model doesn’t enforce any of this on its own.
Quick iteration on agent instructions. Prompts, tool definitions, and policies change weekly. You need to version them, ship them, roll them back, and know which version made the decision you’re now defending in front of a customer.
Governance of workflows. Retries, error handling, batching, idempotency, observability, audit trail. Boring words; immense surface area. This is the thing standing between a useful agent and a status-page incident.
A real system of engagement. Where humans actually do the work, where exceptions get reviewed, and where outputs land. New agents don’t get to skip a decade of CRM hardening because they’re shiny.
A knowledge layer the agents can read and the humans can edit. The instructions, processes, decisions, and context the business runs on — kept somewhere coherent, not scattered across thirty docs and a Slack channel. AI-native operations come apart fast without one.
If you’re one person running a custom agent on your own machine, you can hand-wave most of this. The moment more than one person depends on the output, every one of those categories shows up as a bill — either paid in platform investment, or paid in incidents and trust erosion.
What we use, and why we’re showing our work
I’ll name what we’ve built on, with the caveat that the categories matter more than the logos.
We run AI through Anthropic. We’ve been working closely with the model and I’ve personally been building C-Suite workflows on it for some time now. We’ve wired Claude into every step of how we deliver, building the practice around it, credentialing the team. There’s a bigger story behind that, and it’s coming soon.
We use MuleSoft for the integration and agentic workflow governance layer. It’s where AI stops being a science project and becomes operational software a business can rely on.
We use Salesforce as the system of engagement for customer-facing operations. Fifteen years of hardening doesn’t get rebuilt in a quarter, and we’d rather build on it than around it. Salesforce Headless 360 is a no-brainer for this.
We use Notion as the knowledge and operational system of record — where the processes, agents, and decisions that run the business live and stay coherent. It has earned a real seat at this table, particularly as the surface where AI-native operations actually get composed by the humans who run them.
Other platforms can play these roles. We have opinions, but the point isn’t the logos. The point is that the categories have to be filled — by someone — before you ship something a business can run on.
The stories
This is where each post in the series lives — short, redacted, and stacking up on this page as new ones land. What was tried, what went wrong, why it went wrong architecturally, what good looks like instead. Names scrubbed; the focus stays on the pattern and what lessons can be learned.
Story #1 — Claude Opus 4.7 says not to vibe code IDP
A buyer evaluating MuleSoft IDP gets pulled aside by a solo developer pitching the cheaper alternative: we’ll just home-grow it on Claude and Streamlit. I see this pattern every week. So I went and asked Claude itself.
The prompt was direct: “Someone is positioning Claude + Streamlit as a replacement for MuleSoft IDP. As Claude Opus 4.7, what do you think of this approach?”
Claude’s answer was the punchline:
“Streamlit isn’t an app platform — it’s a notebook with buttons. The Claude API is a building block, not a document processing product. My own opinion of building enterprise document workflows directly on me: don’t.”
The best available model from Anthropic, telling you not to do it and listing all the reasons why.
The demo always works. The hard 20% is invisible — no auth, no audit trail, no eval harness, no prompt versioning, no human-in-the-loop, no defensibility when a regulated output gets challenged, no trusted security model. By month two the customer owns a black box they can’t reason about.
This is exactly why Anthropic is standing up a Partner Program. The model is the brain. The platform is everything around it.
You’re not evaluating Claude vs. a platform like MuleSoft IDP. You’re evaluating one developer’s prototype vs. a governed platform that happens to use Claude or another model under the hood as an intelligence layer.
Read the original post on LinkedIn →
Story #2 — When an Exploratory Agent Took Down a Live Salesforce Org
A national legal-services firm processing high volumes of inbound documents wanted to automate the workflow with AI. Someone — an inside stakeholder, a recommendation from a peer — proposed pointing a popular workflow-automation tool directly at their production Salesforce org. Spin up some flows. Fire up MCP. Agentify Salesforce just by plugging it in.
No doubt the demo worked. The first few weeks probably worked. Then volume picked up.
The agent didn’t know what an API rate limit was. The Salesforce org didn’t know how to defend itself from a caller that wouldn’t back off. Lights went out across the operation for the better part of a day. Records weren’t writing. The intake queue piled up. The escalation paths the org did have were rendered useless because the agent was throttling them too.
Everything that wasn’t there is the same list as Story #1. No retry logic. No backoff with jitter. No circuit breaker. No idempotency. No audit trail of what the agent actually attempted and got refused on. No human-in-the-loop checkpoint for high-volume runs. No governance layer between an enthusiastic agent and a production system that ran on it. And no real plan outside of “agentify.”
The customer is now in the difficult conversation about what happened, and we’re being brought in — slowly, properly, with a real integration platform underneath — to do exactly what someone tried to skip three months ago. The rebuild is the easier part of the recovery.
Story #3 — The Token Bill That Ate the Pilot
You’ve probably seen the number floating around the AI partner ecosystem. A single AI-native company spends this on Anthropic:
$500,000,000
in a MONTH
Most operators hear that and assume it’s an only-at-the-frontier number that has nothing to say about mid-market enterprises.
The same horror story is playing out two zeros smaller, in customers I’m talking to every month. Token spend is the surface symptom. Governance absence is the disease.
The pattern. A buyer kicks off an AI pilot. An internal builder or a friendly vendor pitches the lean version: skip the integration platform, skip the orchestration layer, skip the prompt management tooling, skip the governance scaffolding the rest of the enterprise spent fifteen years installing around production systems. Just have the agent call the LLM directly from whatever workflow tool happens to be in the building.
Pilot one ships in a quarter. Pilot two adds a use case. Pilot three is where it goes sideways.
Here is the part nobody is pricing into their AI roadmap. The governance gap does not scale linearly with the number of legacy workflows you migrate to AI. It scales geometrically. One ungoverned agent is a finance question. Three ungoverned agents calling each other are a compliance question, an incident-response question, a data-classification question, and a cost-attribution question, all at the same time, in the same production environment, with no audit trail explaining how any of them got there.
The bill in the headline is the cost-attribution question made visible. The harder ones never make the headline. Who authorized the agent to read that record? Which version of the agent’s instructions handled the regulated decision a customer is now disputing? When the model gets upgraded next quarter, which workflows break and how would you even know?
The category list is the same as Story #1 and Story #2. Identity. Authorization. Audit. Prompt versioning. Budget enforcement. Concurrency caps. Observability. Human-in-the-loop. Data classification. Model routing. Boring names. The absence is what kills the AI ambition.
The window to install this layer is before the second and third workflows go live. After that, governance becomes a rebuild — and the rebuild competes with every new AI initiative leadership wants to ship.
This is not an AI problem. It’s a governance problem. The token bill is just the first invoice.
More to come
This series isn’t dunking, isn’t a post-mortem for entertainment, isn’t naming-and-shaming. Most of the people making these decisions are smart and working with incomplete information. They deserve better than a takedown.
If you’re navigating these decisions — as a buyer, a builder, or anyone watching this market shake out — bookmark this URL. New stories land here. The horror stories are stacking up faster than I can write them.
