Why so many enterprise AI projects blow up in production — and what has to exist underneath if you want to ship something a business can actually run on.
Stories so far: #1 — Claude Opus 4.7 says not to vibe code IDP · #2 — Exploratory agent takes down Salesforce
I’m in customer AI conversations multiple times per day. From the consultant seat — a MuleSoft and Salesforce delivery partner handling anything from SMB to the S&P 500, an AI-native firm that itself runs on Claude — the same failure pattern keeps repeating across the market, and the stories coming out of it are stacking up faster than I can write them.
The damage isn’t the rebuild bill. It’s the eighteen months of executive momentum that vaporize because the first attempt blew up in front of leadership. Companies don’t get many cracks at this.
That’s the pattern this series is going to dissect.
A buyer is told, sometimes by an internal “tech-savvy” stakeholder, sometimes by a sister tech company, occasionally by a buddy with a slick demo, that AI alone can replace what they’re already paying their integration platform and CRM to do. They wire a prototyping tool into a production system. It works in the demo. It fails in the field.
The whole AI ambition takes a credibility hit the technology didn’t deserve and forward momentum is lost.
What’s actually missing when “AI alone” goes to prod
The honest answer to “should we use AI for this?” in the enterprise is almost never just AI. It’s a stack — and the specific products you pick matter less than the categories of capability that have to be filled underneath.
A few of them, in plain language:
Governance of data. What the model can see, what it can’t, where authorization lives, how PII gets masked, how a compliance category travels with a record through every stage of the pipeline. The model doesn’t enforce any of this on its own.
Quick iteration on agent instructions. Prompts, tool definitions, and policies change weekly. You need to version them, ship them, roll them back, and know which version made the decision you’re now defending in front of a customer.
Governance of workflows. Retries, error handling, batching, idempotency, observability, audit trail. Boring words; immense surface area. This is the thing standing between a useful agent and a status-page incident.
A real system of engagement. Where humans actually do the work, where exceptions get reviewed, and where outputs land. New agents don’t get to skip a decade of CRM hardening because they’re shiny.
A knowledge layer the agents can read and the humans can edit. The instructions, processes, decisions, and context the business runs on — kept somewhere coherent, not scattered across thirty docs and a Slack channel. AI-native operations come apart fast without one.
If you’re one person running a custom agent on your own machine, you can hand-wave most of this. The moment more than one person depends on the output, every one of those categories shows up as a bill — either paid in platform investment, or paid in incidents and trust erosion.
What we use, and why we’re showing our work
I’ll name what we’ve built on, with the caveat that the categories matter more than the logos.
We run AI through Anthropic. We’ve been working closely with the model and I’ve personally been building C-Suite workflows on it for some time now. We’ve wired Claude into every step of how we deliver, building the practice around it, credentialing the team. There’s a bigger story behind that, and it’s coming soon.
We use MuleSoft for the integration and agentic workflow governance layer. It’s where AI stops being a science project and becomes operational software a business can rely on.
We use Salesforce as the system of engagement for customer-facing operations. Fifteen years of hardening doesn’t get rebuilt in a quarter, and we’d rather build on it than around it. Salesforce Headless 360 is a no-brainer for this.
We use Notion as the knowledge and operational system of record — where the processes, agents, and decisions that run the business live and stay coherent. It has earned a real seat at this table, particularly as the surface where AI-native operations actually get composed by the humans who run them.
Other platforms can play these roles. We have opinions, but the point isn’t the logos. The point is that the categories have to be filled — by someone — before you ship something a business can run on.
The stories
This is where each post in the series lives — short, redacted, and stacking up on this page as new ones land. What was tried, what went wrong, why it went wrong architecturally, what good looks like instead. Names scrubbed; the focus stays on the pattern and what lessons can be learned.
Story #1 — Claude Opus 4.7 says not to vibe code IDP
A buyer evaluating MuleSoft IDP gets pulled aside by a solo developer pitching the cheaper alternative: we’ll just home-grow it on Claude and Streamlit. I see this pattern every week. So I went and asked Claude itself.
The prompt was direct: “Someone is positioning Claude + Streamlit as a replacement for MuleSoft IDP. As Claude Opus 4.7, what do you think of this approach?”
Claude’s answer was the punchline:
“Streamlit isn’t an app platform — it’s a notebook with buttons. The Claude API is a building block, not a document processing product. My own opinion of building enterprise document workflows directly on me: don’t.”
The best available model from Anthropic, telling you not to do it and listing all the reasons why.
The demo always works. The hard 20% is invisible — no auth, no audit trail, no eval harness, no prompt versioning, no human-in-the-loop, no defensibility when a regulated output gets challenged, no trusted security model. By month two the customer owns a black box they can’t reason about.
This is exactly why Anthropic is standing up a Partner Program. The model is the brain. The platform is everything around it.
You’re not evaluating Claude vs. a platform like MuleSoft IDP. You’re evaluating one developer’s prototype vs. a governed platform that happens to use Claude or another model under the hood as an intelligence layer.
Read the original post on LinkedIn →
Story #2 — When an Exploratory Agent Took Down a Live Salesforce Org
A national legal-services firm processing high volumes of inbound documents wanted to automate the workflow with AI. Someone — an inside stakeholder, a recommendation from a peer — proposed pointing a popular workflow-automation tool directly at their production Salesforce org. Spin up some flows. Fire up MCP. Agentify Salesforce just by plugging it in.
No doubt the demo worked. The first few weeks probably worked. Then volume picked up.
The agent didn’t know what an API rate limit was. The Salesforce org didn’t know how to defend itself from a caller that wouldn’t back off. Lights went out across the operation for the better part of a day. Records weren’t writing. The intake queue piled up. The escalation paths the org did have were rendered useless because the agent was throttling them too.
Everything that wasn’t there is the same list as Story #1. No retry logic. No backoff with jitter. No circuit breaker. No idempotency. No audit trail of what the agent actually attempted and got refused on. No human-in-the-loop checkpoint for high-volume runs. No governance layer between an enthusiastic agent and a production system that ran on it. And no real plan outside of “agentify.”
The customer is now in the difficult conversation about what happened, and we’re being brought in — slowly, properly, with a real integration platform underneath — to do exactly what someone tried to skip three months ago. The rebuild is the easier part of the recovery.
More to come
This series isn’t dunking, isn’t a post-mortem for entertainment, isn’t naming-and-shaming. Most of the people making these decisions are smart and working with incomplete information. They deserve better than a takedown.
If you’re navigating these decisions — as a buyer, a builder, or anyone watching this market shake out — bookmark this URL. New stories land here. The horror stories are stacking up faster than I can write them.
