Agents in Production: What Works, What Breaks, What We Learned
10/8/2025 •AI Systems •10 min read
The Reality of Agents in Production
Most agent demos look incredible.
Most production agents fail instantly.
Why?
Because demos are *controlled environments*.
Production environments are the exact opposite.
Agents break when:
• instructions conflict
• tools fail
• APIs lag
• output structure shifts
• tasks require judgment instead of pattern matching
Let’s break down what actually works — and what definitely doesn’t.
---
What Breaks First
1. Multi-step tasks
Agents compound errors.
Step 1 is wrong → step 2 is wrong → step 3 is garbage.
2. Tool usage
APIs return weird edge-case responses that break the agent loop.
3. Long context windows
Agents forget instructions halfway through if prompt architecture is weak.
4. Hallucinated tool calls
They “call” tools that don’t exist or mix parameters.
5. Unbounded recursion
Agents keep retrying something that will never succeed.
---
What Actually Works
🔹 1. Deterministic Tools
Every tool should return:
• stable schema
• strict types
• descriptive errors
• minimal ambiguity
No “sometimes returns a list, sometimes returns an object” nonsense.
---
🔹 2. Strict Observability
Production agents require:
• event logs
• tool-call traces
• step-by-step timelines
• agent-state snapshots
• output validation
Agents without monitoring are black boxes.
---
🔹 3. Guarded Autonomy
Limit the agent:
• max steps
• max tokens
• specific allowed tools
• clear stop conditions
• user confirmation checkpoints
Full autonomy is fiction.
Scoped autonomy is production.
---
🔹 4. Routing Models
Small model handles reasoning → large model handles planning.
This reduces costs **and** makes agents more stable.
---
🔹 5. Human-in-the-Loop
The best production agents still leverage humans:
• approve critical decisions
• validate ambiguous outputs
• override failures
• retrain or adjust prompts
Agents aren’t replacing humans —
they’re replacing **busywork**.
---
Key Takeaway
Agents aren’t “AI employees.”
They are **automation workflows with LLM reasoning glued in**.
Treat them like systems — not magic — and they’ll actually work.

