Five engineering reasons AI agents fail in production
Last updated: May 2026
About 97% of executives report deploying AI agents over the past year, but only 12% of those initiatives reach production at scale (Composio AI Agent Report 2025). The gap is an engineering problem, and the failure modes are predictable. This post walks through the five we see most often, the case studies where we ran into them, and the patterns that fix them. It's the postmortem companion to our pillar on agentic AI workflow services.
Key takeaways
- The failure rate is structural, not technical. McKinsey finds that 60% of AI projects fail because of workflow misalignment and governance gaps, not model performance.
- Three patterns dominate the failure data: poor retrieval (dumb RAG), broken tool integrations (brittle connectors), and errors that compound across multi-step plans (Shakudo, 2026). Governance gaps and context engineering misses show up later.
- Tool calling fails between 3 and 15 percent of the time in production (Arize, 2026). Plan for retries, schema validation, and a human-review surface for the long tail.
- The engineering response is consistent across failure modes: observable, evaluated, governed, recoverable. Treat agents like services, not prompts.
1. Dumb RAG: retrieval that demos well and breaks under real queries
The failure mode. Retrieval-augmented generation gets built once, against a curated dataset, with no evaluation pipeline. The system pulls whatever the embedding model returns, feeds it to the LLM, and trusts the answer. It demos beautifully. Then real users arrive with real queries and the agent surfaces wrong products, irrelevant documents, or stale context. The failure is silent — the agent confidently produces an answer that happens to be grounded in the wrong source.
Where we saw it. Any retrieval-heavy build is at risk. On the conversational commerce engagement we ran for a global sportswear brand, the question wasn't "does retrieval work" but "does it return the right product when a shopper asks in their own language." Pinecone returned semantically similar matches; the question was whether that matched buyer intent.
The engineering response. Hybrid retrieval (vector plus keyword plus filters) almost always beats pure vector search for commercial queries. Chunking gets evaluated, not assumed. Every retrieval gets logged with the query, the chunks returned, and a relevance score from an LLM-as-a-judge evaluator. The eval pipeline becomes the gate: a retrieval change ships when the golden-dataset score holds and the LLM-judge score on live traffic does not regress. The fix is not a smarter embedding model — it's treating retrieval as a system you measure, not a library you call.
2. Brittle connectors: the integration-layer tax
The failure mode. The agent can plan but cannot act. The tools it needs live behind APIs nobody documented, schemas drift without notice, and a single failed call propagates into a wrong answer because nothing validates what came back. Tool calling fails between 3 and 15 percent of the time in production (Arize, 2026). At ten tool calls per task, those failure rates compound into double-digit task failure if nothing handles them.
Where we saw it. The Aralab invoice automation agent is a Claude Sonnet 4.5 agent that reads manufacturing invoices and writes back to GCP, Firebase, and the manufacturing finance system. The agent was the easier half; the harder half was making the integration layer survive real-world messiness: malformed invoices, intermittent third-party outages, schema drift. Without that discipline, the agent shipped a clean demo and quietly mishandled the long-tail invoices that don't fit the happy path.
The engineering response. Wrap every tool with an explicit interface, not raw API calls. Validate inputs and outputs against schemas. Retry idempotent calls with backoff and a circuit breaker. Route anything that doesn't validate into a structured human-review queue rather than letting the agent invent a plausible answer. Log every tool call and every fall-back as a first-class trace event. The unsexy framing matters: agents that work in production look more like reliable distributed systems than like prompt engineering.
3. Compounding errors in multi-step plans
The failure mode. The agent makes a small mistake on step one. Step two builds on step one. Step three builds on step two. By step five, the agent is confidently executing the wrong plan because nothing along the way questioned whether step one was right. The pattern is particularly vicious in multiagent systems, where each agent treats the prior agent's output as ground truth.
Where we saw it. The PepTalk agentic operations system runs as a multiagent architecture on AWS Bedrock with pgvector for retrieval and Argilla for evaluation. Multiple specialised agents handle different parts of the speaker-engagement workflow. The early version produced impressive whole-pipeline demos and occasional cliff-edge failures: when one agent misclassified an intent at the top of the chain, the downstream agents executed the wrong workflow with full confidence.
The engineering response. Explicit completion signals, not implicit ones. Each agent declares what success looks like for its step before the orchestrator advances. Validated handoffs: the orchestrator inspects the output schema and the confidence signal, and routes to a human if either is below threshold. Central state instead of distributed state, so the audit trail is one trace, not five overlapping ones. Golden-dataset evaluation runs the whole pipeline, not just individual agents in isolation. In a multiagent system, the orchestrator's job is to be sceptical of every agent it manages.
4. Governance gaps: the second-day problem
The failure mode. The pilot ships. It works. Six months later, an audit lands and the team can't answer basic questions: which model version produced this decision, what data did it see, who approved this prompt change, was there human oversight when the agent made this irreversible call. The agent didn't fail technically. It failed the governance test, and the result is the same — pulled from production, programme cancelled, trust lost.
Where we saw it. Regulated industries make this failure mode obvious early. The AI communications assistant we built for a UK water utility operates in a sector where every customer-facing communication is auditable. We couldn't ship a working agent. We had to ship a working agent whose every decision was traceable to an input, a model version, a prompt, and a tool call, and where a human could intervene before any outbound communication landed.
The engineering response. Build governance into the agent, not next to it. Tool allowlists declared at config time and enforced at runtime. Output schemas the LLM has to fit into before any outbound action. Golden-dataset evaluation in CI on every prompt or model change. Full trace capture through OpenTelemetry into Langfuse or an equivalent. An agent registry the team can query: which agents are running, what tools they can call, who owns them. Audit logs designed from the start to satisfy SOC 2 and the EU AI Act's August 2026 high-risk obligations. The EU AI Act is not adding a new failure mode; it's making an existing one legally enforceable.
5. Context engineering: the silent half of agent reliability
The failure mode. The agent has the right tools, governance, and data, and still produces wrong answers because the context it sees at decision time is incomplete, stale, or contradictory. Conversation history pollutes the prompt. Memory leaks across user sessions. Integration tests pass; the failure is in what the agent knows at the moment it acts.
Where we saw it. The SANA Hotels AI training platform personalises training delivery using AI avatars and Anthropic Claude. The agent has to track each learner's progress, adapt to their level, and avoid repeating content they've mastered. The early failure mode was not technical: the agent occasionally treated a previous learner's context as the current learner's, because session boundaries weren't enforced rigorously enough. The agent was working as designed; the design was wrong about what context belonged to whom.
The engineering response. Context-isolated tasks: every interaction starts from an explicit, scoped context object, not from accumulated conversation. Managed conversation history with deliberate truncation, not whatever fits in the window. Prompt versioning so changes can be rolled back when a quality regression appears. And the part most teams skip: eval datasets that include adversarial cases for context contamination, so the test suite actually exercises the failure mode. This is where senior engineering judgment matters most, because the failure modes don't appear in unit tests.
What the five failures have in common
The engineering response converges on four properties. A production agent is observable (structured traces of every decision, tool call, and output), evaluated (golden datasets in CI plus online quality scoring plus human review), governed (tool allowlists, output schemas, audit logs, agent registry), and recoverable (retries, validation gates, no silently compounding errors). The failure modes vary by case study and industry; the engineering response does not. A team inheriting the system needs these patterns obvious, not buried.
Let's talk


