ai agents engineering production llmops

How to Build an AI Agent That Actually Works in Production

Agentic Runbook ·

The demo always works. You chain a few prompts together, hook up a tool call or two, and watch the agent reason its way through a task that would have taken a human twenty minutes. Everyone in the room is impressed. You ship it to production.

Ninety days later, it’s off. Users have found the edge cases your test cases didn’t cover. The LLM is making confident decisions based on stale context. Someone on your team disabled a guardrail to unblock a deadline and never turned it back on. The agent is still running — it just isn’t working.

This is the most common trajectory for production AI agents in 2025. The gap isn’t between companies that have good engineers and companies that don’t. It’s between teams who understood the production failure modes before they built and teams who didn’t.

Here’s what the teams who ship agents that actually last do differently.


The 4 Failure Modes That Kill Production Agents

Before the build process, you need to understand what you’re building against. These four failure modes account for the majority of production agent collapses we see in mid-market orgs.

1. No Eval Loop

The agent was tested informally during development — a few manual runs, everything looked good. But there’s no automated evaluation running in CI, no regression suite, and no way to detect when model updates or changing data distributions quietly degrade performance. You find out the agent broke when a user complains.

Evaluation isn’t a launch checklist item. It’s the thing that tells you your system is working. Without it, you’re flying blind in production.

2. No Memory Architecture

The agent was built assuming each session is stateless. Works fine in a demo. In production, the tasks that matter almost always require context that spans multiple interactions: what did this user ask for last week, what state did this process leave off in, what decisions were made in the previous run?

Without a deliberate memory model, agents either bloat the context window trying to stuff in everything (expensive, slow, and eventually impossible) or lose track of state entirely (unreliable and user-hostile). Both kill trust fast.

3. Over-Reliance on a Single LLM

The team picked one model, tuned prompts against it, and shipped. Then the model provider pushed an update. Or the task volume grew and costs became unsustainable. Or a specific task category — classification, extraction, generation — needs different capability tradeoffs than the general-purpose model provides.

Production agents are systems, not model wrappers. Treating a single LLM as the entire architecture creates fragile dependencies and cost structures that don’t scale.

4. No Human-in-the-Loop Escape Hatch

The agent handles the easy cases fine. But what happens when it hits something it shouldn’t decide on its own — an edge case it wasn’t designed for, a high-stakes action with irreversible consequences, a situation where confidence is low? If the answer is “it tries anyway,” that’s a problem.

Every production agent needs a defined escalation path: explicit conditions under which it stops, flags, and hands off to a human. Agents that don’t have this will eventually make a mistake that erodes user trust permanently.


The 5-Step Production Build Process

Step 1: Define the Agent’s Scope Ruthlessly

Before you write a line of code, get brutal about what this agent does and what it explicitly does not do.

The most common scoping mistake is building an agent that does “a lot of related things.” Broad scope means broad failure surface. Every capability you add is another vector for unexpected behavior.

One job. One success metric.

For a customer support triage agent, the job is: classify the inbound ticket, match it to the right queue, draft a first-response candidate. The success metric is: percentage of tickets where the human agent edits the draft fewer than three times before sending.

Write both of those down before you start. If you can’t state them in two sentences, you’re not ready to build.

This discipline also forces the right scoping conversation with stakeholders. “We want the agent to handle customer inquiries” is a product fantasy. “We want the agent to reduce first-response time from 4 hours to under 20 minutes for tier-1 support tickets” is a scope you can build to and measure.

Step 2: Design the Eval Framework Before Writing Code

This step feels backwards to most engineering teams and gets skipped constantly. Don’t skip it.

Designing your evaluation framework before writing the agent forces you to answer the question you’ll eventually have to answer anyway: What does success actually look like?

Start by collecting 50–100 real examples of the task the agent will handle. Label them. Define “correct” for each one — not abstractly, but concretely. For a data extraction agent: the right fields, populated correctly, in the right format. For a classification agent: the right label.

Then define your minimum acceptable performance threshold. Not a stretch goal — the floor below which you won’t ship. This number protects you from shipping something that looks functional but isn’t.

Build your ground-truth dataset and your scoring rubric before you write the agent. When you write the code, you’ll run against real eval criteria instead of optimizing for the impressiveness of your own test cases.

Step 3: Choose the Right LLM for Each Task Tier

Production agents almost always have more than one model doing work — and treating every task as equivalent is a fast path to either poor performance or unsustainable cost.

Think in tiers:

Tier 1 — High-volume, low-complexity tasks: Classification, routing, extraction, format validation. These run constantly. Use a smaller, faster, cheaper model — gpt-4o-mini, Claude Haiku, or an open-weights model you host yourself. Cost discipline here funds everything else.

Tier 2 — Medium-complexity reasoning: Drafting responses, synthesizing from multiple sources, structured analysis. This is where a mid-tier model earns its cost: gpt-4o, Claude Sonnet.

Tier 3 — High-stakes, low-frequency decisions: Complex judgment calls, multi-step planning, anything where a wrong answer has material consequences. Reserve your most capable (and most expensive) model for this tier, and run it rarely.

Document your model choice for each tier and the reasoning behind it. Model provider landscapes shift fast. You want to be able to re-evaluate these decisions in six months without rebuilding from scratch.

Step 4: Build the Memory Model

Agents need three distinct types of memory, and conflating them causes most of the architecture problems we see:

Ephemeral (in-context) memory: The current session state — what’s happened in this conversation, this task run, this execution. Lives in the context window. Cheap and immediate, but gone when the session ends. Use it for task-local state only.

Short-term (session) memory: State that needs to survive beyond a single context window but doesn’t need to be permanent. A running summary of a long interaction, intermediate results from a multi-step process, user preferences established earlier in a session. Typically stored in Redis or a lightweight key-value store with TTL.

Long-term (persistent) memory: Facts about users, prior decisions, accumulated knowledge the agent should have access to indefinitely. Lives in a vector database (Pinecone, Weaviate, pgvector) or a structured store, retrieved via semantic search. Design this carefully — what gets written here, when, and under what authority. Unbounded writes to long-term memory create their own reliability problems.

For most production agents, the most important design decision is: what information should not survive this session? Starting from that constraint forces the right trade-offs.

Step 5: Wire in Observability from Day One

Observability is not a post-launch concern. If you ship a production agent without trace logging, you have no mechanism to debug failures, no data to improve from, and no evidence to share with stakeholders when they ask whether it’s working.

The minimum viable observability setup:

  • Full trace logging: Every tool call, every LLM invocation, inputs and outputs, latency, and token counts per step. LangSmith and Helicone are both solid options depending on your stack.
  • Cost per run: Calculated per agent execution, aggregated daily. This is the number that will surprise you at scale.
  • Latency distribution: P50, P95, P99. The average hides the tail latency that users actually experience.
  • Error rate by failure type: Distinguish tool call failures from LLM refusals from schema validation errors. They have different remediation paths.
  • Eval scores in production: Run your scoring rubric against a sample of live production outputs, not just your benchmark dataset. Drift shows up here first.

Set up alerts for cost spikes and error rate increases before you launch. The goal isn’t a dashboard — it’s a system that tells you when something’s wrong before your users do.


Common Shortcuts That Kill Agents in Production

These are the decisions that look fine in the short term and create serious problems within a few months:

  • Skipping golden trace regression tests. You’ll break tool call behavior with a prompt update and not find out until production degrades.
  • Using a single system prompt for everything. Different task types need different instructions. One massive system prompt creates conflicting behavior.
  • No rate limiting on tool calls. An agent in a bad reasoning loop can make thousands of API calls before anyone notices. Set hard limits.
  • Hardcoding model names in your application code. When you need to swap models — and you will — this becomes a refactor instead of a config change.
  • No escalation path defined. “The agent will figure it out” is not an edge case strategy.
  • Building without a staging environment. Testing prompt changes in production because “it’s just a prompt” is how you cause incidents.
  • Over-indexing on benchmark performance. A 92% score on your eval suite means 8% failure in production at whatever volume you’re running. Know what that number means for your users.

Moving from Prototype to Production

The teams that ship durable production agents have one thing in common: they treat the agent as a software system with all the engineering rigor that implies — evaluation, observability, architecture decisions documented, failure modes anticipated. Not a demo you scaled up.

That transition from prototype to production is exactly the problem our Diagnostic Sprint is designed to solve. In 2 weeks, we assess your current architecture against these failure modes, identify the gaps, and deliver a clear path to a production-ready system your team can own and extend.

If you’re sitting on a prototype that works in staging but haven’t been able to get confident enough to ship it, that’s the conversation to start.

Talk to us about the Diagnostic Sprint →


Agentic Runbook designs, builds, and transfers agentic AI systems for mid-market engineering teams. Start with a Diagnostic Sprint →

Ready to build your agentic team?

Start with a Diagnostic Sprint — a 2–4 week structured audit that produces your prioritized Agentic Roadmap.

Start with a Diagnostic →