The demo worked. Everyone in the room was impressed. Three weeks later, the agent was quietly retired and the team went back to doing it manually.

Sound familiar?

Here's what nobody mentions when they sell you on AI agents: model failures are rare. Most agents don't fail because they're stupid. They fail because the environment they're running in wasn't built for them. The gap between "this AI can do that" and "this AI does that consistently, without supervision" is a plumbing problem, not a model problem.

We hit all three failure modes building the receipt intake workflow for Housewire. In testing, the agent processed expenses, categorized transactions, and surfaced anomalies. In production, it answered with high confidence from the wrong context, fell apart on edge cases nobody anticipated, and checked for new receipts every hour on a cron job. None of it was a model problem. All of it was plumbing.

The field's default assumption is that reliability improves when you improve the model. Better model, better prompt, different provider. SI's position is the opposite: the model is rarely the constraint. The environment is. And it fails in three specific ways.

Dumb RAG

Most teams think about retrieval like this: give the agent access to the knowledge and it'll figure out what it needs. That works in a demo. In production, it creates context rot.

Accuracy degrades when you load too much into an agent's working context: entire databases, full document libraries, unstructured dumps. Not because the model can't handle it. Because the signal gets buried. The agent keeps answering, with high confidence, from the wrong part of the context.

The pattern that works in production: just-in-time retrieval. The agent gets structured access to domain knowledge and pulls exactly what the task requires. The benchmark from teams running this in production: keep working context under 8,000 tokens for task-execution agents. That number forces precision about what goes in.

This is an information architecture problem, not an AI problem.

Brittle Connectors

Most software isn't built for agents. It's built for humans, which means APIs designed for web UIs, undocumented field behavior, rate limits that made sense for human-paced interactions, and custom field semantics that live in one senior developer's head.

Point an agent at that surface and you're asking it to navigate terrain it wasn't designed for. The result: the agent works when everything goes right and falls apart the moment it hits an edge case nobody anticipated.

Agent-ready integration surfaces look different. Workflow-level tools where one tool handles one complete operation, not one API call. Descriptions that explain purpose and consequences, not just parameters. Error messages in plain language. "The expense category doesn't match any active source rules" is actionable. "Error 422: Unprocessable Entity" is not.

We built the receipt intake workflow for Housewire twice. The first design exposed raw database operations to the agent. The second defined workflow-level operations. The difference in reliability wasn't incremental. It was categorical.

The Polling Tax

Autonomous agents need to react to things that happen in your system. New record created. Status changed. Threshold crossed. The default approach is polling: check for new items every N minutes.

It seems reasonable. It's quietly catastrophic.

You can't run autonomous agents on request-response infrastructure. Polling introduces latency, burns compute on empty checks, and converts what should be an event-driven process into a slow batch job. An agent checking for new receipts every hour isn't an autonomous system. It's a scheduled job with extra steps, and the most common reason implementations feel unreliable without anyone understanding why.

The fix is straightforward: agents acting on real records need data under five minutes stale. That means webhooks, realtime subscriptions, state change triggers — not cron jobs asking whether anything changed. This is a solved infrastructure problem. The hard part is recognizing your polling architecture is the bottleneck before you've spent six months debugging the wrong thing.

Three questions that tell you more than any benchmark

If you're building or running an AI agent integration, start here:

  1. What does your agent load into context, and what happens to that as your knowledge base grows?

  2. Are the tools available to your agent workflow-level, or are they raw API calls it has to chain together?

  3. When something changes in your system, does the agent find out through a webhook — or by asking?

Most implementations fail question two or three on contact. Fixing them isn't glamorous. It's wiring work. But it's the wiring that decides whether your agent runs in production or gets quietly retired.

The demo always works. What you're actually evaluating is the plumbing.

Which of the three failure modes have you hit? Reply with the one that cost you the most time — I'm compiling patterns.

The plumbing is fixed. The context window still has a problem most teams have never heard named. Next issue: why retrieval isn't the layer you need to fix — and what is.

Systems Intelligence is the honest trade publication for AI automation practitioners. No demos. No hype. Just what it actually takes to make AI reliable in production. Subscribe at systemsintel.dev.

Keep reading